[MRG] Adding SAX+MINDIST to KNN #152

GillesVandewiele · 2019-09-29T10:38:32Z

This PR contains the following changes:

'sax' is now a valid metric for KNN:

knn = KNeighborsTimeSeriesClassifier(n_neighbors=1, metric='sax')

Added BaseEstimator to classes in preprocessing module so that they can be used within a Pipeline (errors were raised when using TimeSeriesScalerMeanVariance)
Fixed a bug in kneighbors method which would always return [0] as nearest neighbor for every sample.

knn = KNeighborsTimeSeriesClassifier(n_neighbors=1, metric='dtw')
knn.fit(X_train, y_train)
_, ind = knn.kneighbors(X_test)
# ind would be filled with 0's

Slightly changed to code of kneighbors so that its result is consistent with sklearn. There was a small difference in breaking ties (tslearn would pick largest index while sklearn would pick the smallest index). Now the following code is equivalent:

knn = KNeighborsTimeSeriesClassifier(n_neighbors=1, metric='dtw')
knn.fit(X_train, y_train)
_, ind = knn.kneighbors(X_test)

knn = KNeighborsTimeSeriesClassifier(n_neighbors=1, metric='precomputed')
all_X = numpy.vstack((X_train, X_test))
distances = pairwise_distances(all_X, metric=dtw)
X_train = distances[:len(X_train), :len(X_train)]
X_test = distances[len(X_train):, :len(X_train)]
knn.fit(X_train, y_train)
_, ind = knn.kneighbors(X_test)

# both ind vectors are now equal (while that was not the case before this PR)

Some remarks:

I am unexperienced with numba; adding an njit decorator to cdist_sax did not work immediately, I could perhaps use some help with that.

pep8speaks · 2019-09-29T10:38:38Z

Hello @GillesVandewiele! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-03-16 15:28:59 UTC

GillesVandewiele · 2019-09-29T10:47:19Z

Could also maybe important to note that the technique doesn't really work that well, the results are quite consistent with the paper though, but I thought SAX would work insanely fast, which turns out not to be the case currently:

EDIT: Made a mistake in the code when printing times for SAX, updated results are below (I'll commit the fix soon)

|      dataset       | sax error  |  sax time  | eucl error | eucl time  |
--------------------------------------------------------------------------
|    SyntheticControl|     0.02667|     9.27606|        0.12|     0.02993|
|            GunPoint|     0.20667|     1.21585|     0.08667|     0.01001|
|             OSULeaf|     0.47521|    15.51386|     0.47934|     0.03264|
|               Trace|        0.52|     2.98893|        0.24|     0.01124|
|            FaceFour|     0.14773|     0.58396|     0.21591|       0.006|
|          Lightning2|     0.21311|     1.85655|      0.2459|     0.00766|
|          Lightning7|     0.46575|     1.31525|     0.42466|      0.0096|
|              ECG200|        0.12|     0.96109|        0.12|     0.00895|
|                Fish|     0.50286|     7.46496|     0.21714|     0.02598|
|               Plane|     0.04762|     1.68453|      0.0381|     0.01058|
|                 Car|        0.35|     1.87007|     0.26667|      0.0078|
|                Beef|     0.53333|      0.2134|     0.33333|     0.00438|
|              Coffee|     0.46429|     0.22361|         0.0|     0.00569|
|            OliveOil|     0.83333|     0.52508|     0.13333|     0.00589|
--------------------------------------------------------------------------

…f-bounds-error in kneighbors

rtavenar · 2019-09-29T12:23:54Z

Great @GillesVandewiele

I will try to review this PR asap, probably early next week. I am still unsure if we should target to get a fast SAX+MINDIST in this PR or delay that to a later PR.

GillesVandewiele · 2019-09-30T07:08:39Z

Hi @rtavenar there's no rush in reviewing I'd say. I still need to fix Travis etc, so maybe after that passes reviewing is more appropriate.

EDIT: Another TODO is update docs + tests (mostly putting there here for myself so I don't forget ;))

…earn into dev-mindist

GillesVandewiele · 2019-10-03T13:30:11Z

Hi Romain,

Seems like travis is now building successfully. I did not add any tests yet though, as it doesn't really fit in the test_neighbors.py script currently (its output will not be equal to that of the euclidean KNN). Any suggestions on what type of tests would be interesting to add?

Kind regards,
Gilles

rtavenar · 2019-10-03T19:23:39Z

That looks great @GillesVandewiele !

Maybe a minimal test would be to check that the distances returned by kNN's kneighbors are indeed lower bounding L2 (I think they should be, if I remember the original paper correctly, but maybe it would be worth checking :)

Hope this helps

codecov-io · 2019-10-04T12:53:11Z

Codecov Report

Merging #152 into dev will increase coverage by 0.17%.
The diff coverage is 93.75%.

@@            Coverage Diff             @@
##              dev     #152      +/-   ##
==========================================
+ Coverage   93.62%   93.79%   +0.17%     
==========================================
  Files          25       25              
  Lines        3404     3419      +15     
==========================================
+ Hits         3187     3207      +20     
+ Misses        217      212       -5

Impacted Files	Coverage Δ
tslearn/utils.py	`93.41% <100%> (-0.95%)`	⬇️
tslearn/clustering.py	`93.33% <100%> (ø)`	⬆️
tslearn/piecewise.py	`97.56% <100%> (+0.03%)`	⬆️
tslearn/tests/test_neighbors.py	`100% <100%> (ø)`	⬆️
tslearn/shapelets.py	`94.57% <100%> (-0.03%)`	⬇️
tslearn/tests/test_variablelength.py	`100% <100%> (ø)`	⬆️
tslearn/svm.py	`98.43% <100%> (ø)`	⬆️
tslearn/metrics.py	`97.19% <100%> (+0.01%)`	⬆️
tslearn/neighbors.py	`87.04% <88.4%> (+6.03%)`	⬆️
tslearn/preprocessing.py	`93.24% <93.33%> (+1.3%)`	⬆️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2c2ac9a...37cebe0. Read the comment docs.

GillesVandewiele · 2019-10-04T13:01:57Z

Hi Romain, I added the test. I'm not sure how this codecov report is actually generated, it's weird that it is reporting about barycenters.py in this PR...

There's also one other point that's maybe worth discussing. Currently, I always perform a SAX transformation on the provided input data, but this then actually assumes that the user has not already done that. Moreover, the current SAX transformation will fail pretty badly if the data is not standard-normalized. So maybe it could be better to actually change the metric to mindist and require the user to standard-normalize + sax-transform the data?

rtavenar · 2019-10-04T15:23:53Z

@GillesVandewiele

This is a very good question. I do not think we can use Pipelines here, because we need parameters of the previous step for the distance computation, so I guess the way you did it is reasonable, or at least I do not see a more straight-forward approach.

Maybe having a Pipeline (standardize + SAX kNN) in your gallery example (together with some comment saying that this is good practice for SAX to standardize your data beforehand) would be a good idea. And a note in the kNN doc (under the metric param) also.

Let me know when I should start my review, I could probably do it next week if you feel your code is in sufficiently advanced state.

Have a nice week-end,
Romain

GillesVandewiele · 2019-10-05T07:27:14Z

Two other options would be the following:

always standard-normalize the input that we get. As far as I know, standard-normalizing is quite idempotent (so normalizing data that is already normalized will not really have great effects). Moreover, we do the normalization per timeseries individually which makes it a lot easier to do this as well.
Perform some hypothesis testing on (a sample of) the input data and raise a warning or perform normalization when results of hypothesis test are that the data isn't normally distributed.

What do you want me to do about the codecov btw?

rtavenar · 2019-10-05T12:32:46Z

Hi,

I would be in favor of option 1.
However,

this should be documented
normalization should be performed dataset-wise, not per time series, in this case, I think

Concerning codecov, I can have a look when I'll do a review.

GillesVandewiele · 2019-10-07T06:31:58Z

Forgot to mention it yesterday, but I think the code is now ready to be reviewed, whenever you have time @rtavenar. I refactored quite a lot of code in neighbors.py so we definitely need to be careful there.

rtavenar · 2019-10-11T15:14:51Z

I did not find time to review it this week, sorry. I hope I can do it next week.

…earn into dev-mindist

GillesVandewiele · 2020-03-08T11:40:30Z

I handled most of the feedback and left some comments where needed.

… _transform

… _transform of SAX

johannfaouzi · 2020-03-15T15:53:08Z

One test if failing for KNN with SAX. My guess would be that

X = to_time_series_dataset([[1, 2, 3, 4],
                            [1, 2, 3],
                            [2, 5, 6, 7, 8, 9],
                            [3, 5, 6, 7, 8]])

is transformed into SAX with two segements

clf = KNeighborsTimeSeriesClassifier(metric="sax", n_neighbors=1,
                                     metric_params={'n_segments': 2})

and all the new time series are equal to [0, 1], and when you perform classification with 1NN you may get the class of the first sample (np.argmin returns the first index).

Could you try to print the SAX transformation for this dataset? You could also try to make the time series for class 1 decreasing instead of increasing, it should fix this.

GillesVandewiele · 2020-03-16T07:16:29Z

One test if failing for KNN with SAX. My guess would be that
X = to_time_series_dataset([[1, 2, 3, 4],
                            [1, 2, 3],
                            [2, 5, 6, 7, 8, 9],
                            [3, 5, 6, 7, 8]])
is transformed into SAX with two segements
clf = KNeighborsTimeSeriesClassifier(metric="sax", n_neighbors=1,
                                     metric_params={'n_segments': 2})
and all the new time series are equal to [0, 1], and when you perform classification with 1NN you may get the class of the first sample (np.argmin returns the first index).

Could you try to print the SAX transformation for this dataset? You could also try to make the time series for class 1 decreasing instead of increasing, it should fix this.

Thanks Johann! Let me see if I can fix it :)

johannfaouzi

LGTM, thanks Gilles! @rtavenar

GillesVandewiele · 2020-03-16T08:17:30Z

LGTM, thanks Gilles! @rtavenar

No thank you for again some great feedback! Always a great learning experience to do a PR here! Now next up should be some improvements for the ShapeletModel. I hope I can do it faster than this one ;)

rtavenar · 2020-03-16T13:25:25Z

This looks good to me too. I had a simple comment regarding the documentation and my last comment is that you can document your changes in the changelog before merge.

Great job, once again, thanks @GillesVandewiele !

GillesVandewiele · 2020-03-16T13:42:57Z

Hi Romain,

Thanks! Where/how do you want me to add it to the changelog? Do I add the following just on top?

## [v0.4.0]

### Changed
* TimeSeriesScalerMeanVariance and TimeSeriesScalerMinMax are now completely sklearn-compliant
* Bugfix in kneighbors() methods.

### Added
* Nearest Neighbors on SAX representation (with custom distance)
* Calculate pairwise distance matrix between SAX representations
* PiecewiseAggregateApproximation can now handle variable lengths

rtavenar · 2020-03-16T13:45:17Z

Yes you can create a ## [Towards v0.4.0] section and describe your changes there.

GillesVandewiele · 2020-03-16T13:46:23Z

Alright done! :)

tslearn/metrics.py

GillesVandewiele added 2 commits September 29, 2019 12:27

adding 'sax' as metric to KNN

e37c2dc

cleaned up example code

8447e33

rtavenar changed the base branch from master to dev September 29, 2019 10:42

pep8 issues

7b21be0

remove slow datasets, do not pass n_jobs to cdist_soft_dtw, fix out-o…

ae32c93

…f-bounds-error in kneighbors

GillesVandewiele changed the title ~~Adding SAX+MINDIST to KNN~~ [WIP] Adding SAX+MINDIST to KNN Sep 30, 2019

GillesVandewiele added 5 commits October 3, 2019 11:51

bug in argpartition

9d853a7

Merge branch 'dev' into dev-mindist

694c9b0

make meanvar and minmax scalers sklearn compliant

0d34592

Merge branch 'dev-mindist' of https://github.com/GillesVandewiele/tsl…

146f937

…earn into dev-mindist

replace _more_tags by _get_tags and update docs

f16d95a

GillesVandewiele added 3 commits October 4, 2019 14:16

added test to check sax_mindist <= euclidean_dist

bb4f37d

updated docs

18065a1

trailing whitespace

550f846

GillesVandewiele added 3 commits October 6, 2019 09:57

standard-normalize & sax-transform the input data

500012b

refactoring of code and bugfix in kneighbors()

5a1127f

pep8 issues

4fcd205

GillesVandewiele added 5 commits March 8, 2020 12:07

do not store entire X when not needed & refactor check_dims

ee101f0

Merge branch 'dev' into dev-mindist

3b28e5d

create VALID_METRICS array and check for those in neighbors.py

94c87ac

Merge branch 'dev-mindist' of https://github.com/GillesVandewiele/tsl…

9ebdbf2

…earn into dev-mindist

check_dims bug in neighbors.py

7d811bd

GillesVandewiele added 5 commits March 8, 2020 12:44

replace all occurrences of VARIABLE_LENGTH_METRICS in neighbors.py

1a7f11b

remove sax_preprocess (user has to normalize timeseries) and refactor…

a251f42

… _transform

accidently indented _transform into fit

58e8915

import ts_size

77b5ca5

only remove std-scaling from sax_preprocessing and axis=0 for mean in…

b67ae35

… _transform of SAX

make class 1 timeseries decreasing so that sax test succeeds

4372306

johannfaouzi approved these changes Mar 16, 2020

View reviewed changes

docs and pep8

b3743d7

updated CHANGELOG

a02d2fe

add more docs to cdist_sax

9e0cabe

rtavenar reviewed Mar 16, 2020

View reviewed changes

tslearn/metrics.py Outdated Show resolved Hide resolved

GillesVandewiele added 2 commits March 16, 2020 16:26

small change in doc of cdist_sax

b9f0c2d

add "each" to doc of cdist_sax

37cebe0

rtavenar changed the title ~~[WIP] Adding SAX+MINDIST to KNN~~ [MRG] Adding SAX+MINDIST to KNN Mar 16, 2020

rtavenar merged commit a5f66e1 into tslearn-team:dev Mar 16, 2020

rtavenar mentioned this pull request Mar 29, 2020

kNN using SAX+MINDIST #28

Closed

GillesVandewiele mentioned this pull request May 20, 2020

Algorithm SAX should incorporate rescaling of the data. #237

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] Adding SAX+MINDIST to KNN #152

[MRG] Adding SAX+MINDIST to KNN #152

GillesVandewiele commented Sep 29, 2019

pep8speaks commented Sep 29, 2019 •

edited

GillesVandewiele commented Sep 29, 2019 •

edited

rtavenar commented Sep 29, 2019

GillesVandewiele commented Sep 30, 2019 •

edited

GillesVandewiele commented Oct 3, 2019

rtavenar commented Oct 3, 2019

codecov-io commented Oct 4, 2019 •

edited

GillesVandewiele commented Oct 4, 2019

rtavenar commented Oct 4, 2019

GillesVandewiele commented Oct 5, 2019

rtavenar commented Oct 5, 2019

GillesVandewiele commented Oct 7, 2019

rtavenar commented Oct 11, 2019

GillesVandewiele commented Mar 8, 2020

johannfaouzi commented Mar 15, 2020

GillesVandewiele commented Mar 16, 2020

johannfaouzi left a comment

GillesVandewiele commented Mar 16, 2020

rtavenar commented Mar 16, 2020

GillesVandewiele commented Mar 16, 2020 •

edited

rtavenar commented Mar 16, 2020

GillesVandewiele commented Mar 16, 2020

[MRG] Adding SAX+MINDIST to KNN #152

[MRG] Adding SAX+MINDIST to KNN #152

Conversation

GillesVandewiele commented Sep 29, 2019

pep8speaks commented Sep 29, 2019 • edited

Comment last updated at 2020-03-16 15:28:59 UTC

GillesVandewiele commented Sep 29, 2019 • edited

rtavenar commented Sep 29, 2019

GillesVandewiele commented Sep 30, 2019 • edited

GillesVandewiele commented Oct 3, 2019

rtavenar commented Oct 3, 2019

codecov-io commented Oct 4, 2019 • edited

Codecov Report

GillesVandewiele commented Oct 4, 2019

rtavenar commented Oct 4, 2019

GillesVandewiele commented Oct 5, 2019

rtavenar commented Oct 5, 2019

GillesVandewiele commented Oct 7, 2019

rtavenar commented Oct 11, 2019

GillesVandewiele commented Mar 8, 2020

johannfaouzi commented Mar 15, 2020

GillesVandewiele commented Mar 16, 2020

johannfaouzi left a comment

Choose a reason for hiding this comment

GillesVandewiele commented Mar 16, 2020

rtavenar commented Mar 16, 2020

GillesVandewiele commented Mar 16, 2020 • edited

rtavenar commented Mar 16, 2020

GillesVandewiele commented Mar 16, 2020

pep8speaks commented Sep 29, 2019 •

edited

GillesVandewiele commented Sep 29, 2019 •

edited

GillesVandewiele commented Sep 30, 2019 •

edited

codecov-io commented Oct 4, 2019 •

edited

GillesVandewiele commented Mar 16, 2020 •

edited