Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH, BUG] Distance refactor and bug fix #2268

Merged
merged 151 commits into from Apr 13, 2022
Merged

[ENH, BUG] Distance refactor and bug fix #2268

merged 151 commits into from Apr 13, 2022

Conversation

TonyBagnall
Copy link
Contributor

@TonyBagnall TonyBagnall commented Mar 21, 2022

A range of internal changes to distances module and how they are used. The primary change is in the internal representation. From the beginning, we assumed an instance was of shape (m,d), where m is the series length and d is the number of dimensions. The reason for this was because thats how others did it, and there were some native python operations that were more efficient. However, it was later decided to use and (n,d,m) throughout sktime for numpy representations (n is number of cases) . This meant any estimator using the distances would need to transform internally, since it is not possible to know if data is passed in (d,m) or (m,d) shape. This has caused errors in code (e.g. KNN was broken for a while in 2021 because the transform was removed) and its generally far too brittle. This PR converts all distances to expect input series to be shape (d,m). Given it is all compiled to C, there will be no performance hit, its all just loops in the end. This means we can remove all transposing from distance clusterers and classifiers.

This PR has got a bit too big, apologies. A few other things includes

  1. Classifiers and clusterers using difference series now have option to taking the difference just once. direct calls to ddtw and wddtw still take the difference of the series, but kmeans takes the difference once and switches from ddtw/wddtw to dtw/wdtw
    This has been done by introducing another function average_of_slope_transform in ddtw rather than as a transformer. We could make it a transformer if thought desirable
  2. Add doc strings to describe distances. Let me know if any of these are unclear. I could have added more mathematical description for some. However, I have written the maths in sphinx, but I currently have no way of locally debugging it, so not sure what it looks like. I've put references for each.
  3. Add correctness tests for distances. These should be restructured to conform to standard parameterised approach, but would rather raise an issue and do that after the thing is in, would be a good first issue
  4. Single problem loads unit_test and basic_motions now have option to return numpy, but no other changes in the problem loaders, so hope this is not controversial
  5. fixes a bug in EDR and ERP, both of which were not using the correct data in calculations due to a typo in a nested local function
  6. fixed a bug in numba conversion that was still going to (m,d) on edge cases
  7. fixed pairwise_distance to remove transpose
  8. remove all transposing from distance based classifiers and clusterers

@TonyBagnall
Copy link
Contributor Author

note to self, wait until after Franz does pydata thing to merge this

@TonyBagnall TonyBagnall merged commit 84852f6 into main Apr 13, 2022
@TonyBagnall TonyBagnall deleted the distance_refactor branch April 13, 2022 17:54
@fkiraly
Copy link
Collaborator

fkiraly commented Apr 13, 2022

note to self, wait until after Franz does pydata thing to merge this

Indeed, the pydata thing is done now 😄

srggrs added a commit to Gridsight/sktime that referenced this pull request Apr 19, 2022
* upstream/main:
  [ENH] more forecaster scenarios for testing: using `X` (sktime#2462)
  less uncertainty samples (sktime#2497)
  [ENH] remove error message on exogeneous X from DirRec reducer (sktime#2463)
  [BUG] fix accidental overwrite of default method/arg sequences in test scenarios (sktime#2457)
  changed references to fit-in-transform to fit_is_empty (sktime#2494)
  [MNT] loosen strict upper bound on `scipy` to 1.9.0 (sktime#2474)
  [DOC] Added clustering module to API docs (sktime#2429)
  [ENH] Add prediction intervals for `UnobservedComponets` forecaster (sktime#2454)
  [BUG] remove `alpha` arg from `_boxcox`, remove private method dependencies, ensure scipy 1.8.0 compatibility (sktime#2468)
  [BUG] removed metric integration notebook (sktime#2476)
  [ENH] refactor `_predict_moving_cutoff` and bugfix, outer `update_predict_single` should be called (sktime#2466)
  [ENH, BUG] Distance refactor and bug fix (sktime#2268)
  [ENH] add new argument return_tags to all_estimators (sktime#2410)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:distances&kernels dists_kernels and distances modules: time series distances, kernels, pairwise transforms refactor Restructuring without changing its external behavior. Neither fixing a bug nor adding a feature.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants