Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: 1D arrays with KDTree #19188

Closed
rjsdotorg opened this issue Sep 5, 2023 · 9 comments · Fixed by #19391
Closed

DOC: 1D arrays with KDTree #19188

rjsdotorg opened this issue Sep 5, 2023 · 9 comments · Fixed by #19391
Labels
Documentation Issues related to the SciPy documentation. Also check https://github.com/scipy/scipy.org good first issue Good topic for first contributor pull requests, with a relatively straightforward solution scipy.spatial
Milestone

Comments

@rjsdotorg
Copy link

Is your feature request related to a problem? Please describe.

We want to find close values between two 1D arrays of sampled plethysmographic data peaks, 25 sample points in this case.
KDTree does not allow 1D inputs

  File "C:\Users\Ray\Documents\GitHub\quality\lib\utils\file_ops.py", line 154, in reset_ppg
    tree = KDTree(peaks)
  File "C:\Python38\lib\site-packages\scipy\spatial\_kdtree.py", line 360, in __init__
    super().__init__(data, leafsize, compact_nodes, copy_data,
  File "_ckdtree.pyx", line 557, in scipy.spatial._ckdtree.cKDTree.__init__
ValueError: data must be 2 dimensions

_ckdtree.pyx line 557

        if data.ndim != 2:
            raise ValueError("data must be 2 dimensions")

so one has to add an empty dimension at the end:

        # peaks.shape == [100,] and peaks_filtered == [100,]
        tree = KDTree(peaks.reshape(-1, 1))
        other_tree = KDTree(peaks_filtered_tight.reshape(-1, 1))
        close_indexes = tree.query_ball_tree(other_tree, 25)

It took me a while to realize that I could do this as there is no mention in the docs.

Describe the solution you'd like.

Add a test for 1D inputs and promote to 2D, probably right where the ValueError is tested.

Describe alternatives you've considered.

Similar methods using searchsorted() or where() are much more difficult to construct the conditions, and using KDTree is simple and elegant.

Additional context (e.g. screenshots, GIFs)

peaks = [1006,  2808,  4625,  6448,  8252,  10076,  11883,  13697,  15525,  17327, 19151,  20957,  22766,  24593,  26384,  28195,  29993,  31778,  33588,  35369, 37156,  38959,  40732,  42529,  44311,  46094,  47890,  49666,  51462,  53260, 55048,  56851,  58649,  60440,  62257,  64049,  65849,  67668,  69455,  71256, 73059,  74832,  76631,  78415,  80191,  81981,  83753,  85531,  87321,  89075, 90850,  92620,  94374,  96156,  97914,  99677, 101457, 103218, 104983, 106759, 108507, 110279, 112053, 113799, 115570, 117352, 119107, 120877, 122661, 124417, 126183, 127964, 129726, 131488, 133270, 135034, 136792, 138583, 140343, 142114, 143904, 145673, 147444, 149234, 151010, 152781, 154576, 156341, 158126, 159923, 161692, 163486, 165284, 167052, 168848, 170647, 172422, 174219, 176004, 177783, 179579, 181371, 183151, 184954, 186733, 188518, 192078, 193934, 195709, 197494, 199279, 201041, 202816, 204606, 206372, 208174, 209950, 211749, 213564, 215357, 217161, 218965, 220751, 222569, 224355, 226148, 227962, 229746, 231541, 235078, 236933, 238689, 240461, 242231, 243980, 245757, 247529, 249295, 251088, 252861, 254651, 256458, 258248, 260053, 261881, 263679, 265508, 267324, 269128, 270964, 272779, 274337, 276389, 278231, 280042, 281871, 283650, 285465, 287262, 289068, 290886, 292687, 294496, 296330, 298130, 299956, 301788, 303600, 305436, 307267, 309087, 310927, 312728, 314550, 316363, 318159, 319984, 321775, 323575, 325384, 327162, 328968, 330761, 332536, 334341, 336130, 337915, 339728, 341524, 343321, 345143, 346941, 348755, 350585, 352382, 354204, 356026, 357837, 359662, 361471, 363284, 365111, 368679, 370525, 372292, 374089, 375874, 377641, 379433, 381211, 384753, 386585, 388349, 390145, 391909, 393696, 395500, 397287, 399063, 400873, 402649, 404426, 406224, 408002, 409788, 411598, 413372, 415182, 416986, 418787, 420610, 422415, 424232, 426068, 427878, 429699, 431532, 433341, 435167, 437001, 438811, 440655, 442476, 444300, 446137, 447942, 449773, 451584, 453396, 455210, 456992, 458803, 460595, 462381, 464188, 465976]



peaks_filtered_tight = [1001,  2804,  4621,  6444,  8246,  10072,  11877,  13692,  15519,  17322, 19146,  20952,  22760,  24588,  26379,  28189,  29989,  31774,  33581,  35364, 37151,  38954,  40728,  42524,  44306,  46087,  47886,  49662,  51457,  53254, 55042,  56846,  58643,  60436,  62253,  64045,  65844,  67663,  69449,  71251, 73054,  74830,  76626,  78409,  80187,  81978,  83747,  85526,  87316,  89071, 90846,  92615,  94369,  96151,  97910,  99673, 101452, 103211, 104978, 106754, 108502, 110271, 112048, 113794, 115565, 117346, 119103, 120872, 122655, 124413, 126179, 127960, 129721, 131484, 133266, 135029, 136789, 138578, 140339, 142109, 143900, 145668, 147440, 149230, 151005, 152778, 154572, 156337, 158122, 159917, 161687, 163480, 165278, 167048, 168844, 170641, 172417, 174215, 175999, 177779, 179577, 181364, 183143, 184952, 186733, 188510, 192073, 193935, 195706, 197487, 199274, 201039, 202813, 204599, 206366, 208168, 209947, 211745, 213560, 215354, 217156, 218959, 220748, 222565, 224349, 226143, 227961, 229745, 231533, 235075, 236934, 238684, 240451, 242226, 243979, 245753, 247523, 249290, 251084, 252856, 254646, 256456, 258241, 260048, 261878, 263678, 265502, 267319, 269127, 270962, 272771, 274328, 276384, 278231, 280036, 281862, 283646, 285463, 287258, 289063, 290881, 292682, 294492, 296325, 298126, 299950, 301784, 303597, 305431, 307264, 309082, 310923, 312723, 314545, 316358, 318156, 319979, 321771, 323571, 325378, 327157, 328961, 330756, 332533, 334336, 336125, 337911, 339723, 341519, 343317, 345138, 346937, 348750, 350582, 352378, 354199, 356022, 357832, 359656, 361469, 363282, 365105, 368673, 370527, 372287, 374079, 375868, 377643, 379431, 381204, 384749, 386584, 388343, 390139, 391905, 393693, 395495, 397280, 399059, 400868, 402644, 404421, 406219, 407996, 409783, 411594, 413367, 415176, 416982, 418782, 420606, 422409, 424228, 426063, 427872, 429694, 431527, 433336, 435163, 436995, 438809, 440649, 442472, 444294, 446132, 447940, 449768, 451580, 453390, 455204, 456990, 458800, 460590, 462374, 464184, 465969]
@rjsdotorg rjsdotorg added the enhancement A new feature or improvement label Sep 5, 2023
@tylerjereddy
Copy link
Contributor

The docs do indicate that the data input is of shape (n,m) and:

The n data points of dimension m to be indexed.

So I'd probably read that as shape (n, 1) for 1-D data, and also always 2-D arrays for any n-D dimensions really. There is occasional confusion about the fact that 2-D arrays can effectively support arbitrary dimensions, but it isn't isolated to KDTree.

I don't know that we'd want to reshape 1-D input by design/automatically--this could also silently allow through a mistake when a user accidentally meant to pass in high dimensional data but only provided a single record/point as a 1-D array of columns. cc @sturlamolden perhaps

@rjsdotorg
Copy link
Author

What I'd meant perhaps was that the docs do not mention that 1D data is acceptable if trivially reshaped to 2D: a.reshape(-1, 1)
I understand the conservative perspective regarding automatic reshape, but that would also be a case where users are relying on the __init__ to do their data checking for them.

IMO that a.reshape(-1, 1) be mentioned in examples (of which KDTree has none and the Cookbook is quite complex).

Examples

Find the nearest neighbors of a pair of 1D data arrays or lists

>>> from scipy.spatial import KDTree
>>> import numpy as np
>>> peaks = np.array([1036,  2808,  4625,  6448,  8292,  10076,  11883])
>>> peaks_2 = np.array([1001,  2804,  4621,  6444,  8246,  10072,  11877])
>>> tree = KDTree(peaks.reshape(-1, 1))
>>> other_tree = KDTree(peaks_2.reshape(-1, 1))
>>> close_indexes = tree.query_ball_tree(other_tree, 25)
>>> close_indexes
[[], [1], [2], [3], [], [5], [6]]
>>> np.hstack(close_indexes).astype(np.int)
array([1, 2, 3, 5, 6])
>>> np.hstack(close_indexes, dtype=np.int)
array([1, 2, 3, 5, 6])

@sturlamolden
Copy link
Contributor

That data array must be 2-dimensional does not mean that the kd-tree only works for 2-dimensional data. The data array has to be of shape (n, d) where n is the number of data and d is the numer of dimensions. Presumably 1D data would the require a data array of shape (n, 1). That can be taken care of with a simple reshape. The kd-tree cannot accept 1D data arrays because there would be no way of telling the difference between (n, 1) and (1, n).

That said, I do not know why anyone would want to use a kd-tree on 1D data. It is the kind of thing a binary search tree (BST) does better.

@rjsdotorg
Copy link
Author

The kd-tree cannot accept 1D data arrays because there would be no way of telling the difference between (n, 1) and (1, n)

for n = 1, true

That said, I do not know why anyone would want to use a kd-tree on 1D data. It is the kind of thing a binary search tree (BST) does better.

Well, "better" in this case means a correct result with 4 lines of code - there is no BST function as such in scipy/numpy, and 3rd party like Binary_Search_Trees isn't as succinct and also adds to the project S.O.U.P. tracking required by FDA.

@sturlamolden
Copy link
Contributor

I have no idea what SOUP tracking is 🤔

@rkern
Copy link
Member

rkern commented Sep 5, 2023

"Software Of Unknown Provenance"

@tylerjereddy tylerjereddy added Documentation Issues related to the SciPy documentation. Also check https://github.com/scipy/scipy.org and removed enhancement A new feature or improvement labels Sep 5, 2023
@tylerjereddy
Copy link
Contributor

I'll change to a documentation label then, I think Sturla and I both said pretty similar things.

@tylerjereddy tylerjereddy changed the title ENH: add support for 1D arrays to KDTree DOC: 1D arrays with KDTree Sep 5, 2023
@sturlamolden
Copy link
Contributor

sturlamolden commented Sep 6, 2023

The kd-tree cannot accept 1D data arrays because there would be no way of telling the difference between (n, 1) and (1, n)

for n = 1, true

No, what I mean is we cannot guess if (n,) should be (n,1) or (1,n), for any n, or even (n//d, d). It would have different meaning in this context. (n,1) means n points in an 1D space and (1,n) is a single point in an n-D space. A reshape on behalf of the user makes the intent clear.

I think perhaps that the error message is confusing as it could be read as if the the data needs to be (n, 2), i.e. that only data points in a 2D space are allowed. Maybe we should improve on it and say it should be shape (n, d) instead.

@rjsdotorg
Copy link
Author

ValueError: data must be 2 dimensions

could be perhaps (from my user perspective)

ValueError: data must have shape (n, d) where n>=1, d>=1

@j-bowhay j-bowhay added the good first issue Good topic for first contributor pull requests, with a relatively straightforward solution label Sep 6, 2023
E-W-Jones added a commit to E-W-Jones/scipy that referenced this issue Oct 15, 2023
Based on the discussion in scipy#19188, change the error message to make it
clearer that we're dealing with a 2D numpy array, but the data in the
array can be any dimension (1D, 2D, ...).
j-bowhay added a commit that referenced this issue Oct 16, 2023
#19391)

* DOC: Change error message in spatial.KDTree to be more informative.

Based on the discussion in #19188, change the error message to make it
clearer that we're dealing with a 2D numpy array, but the data in the
array can be any dimension (1D, 2D, ...).

* Update scipy/spatial/_ckdtree.pyx

Co-authored-by: Jake Bowhay <60778417+j-bowhay@users.noreply.github.com>

---------

Co-authored-by: Jake Bowhay <60778417+j-bowhay@users.noreply.github.com>
@j-bowhay j-bowhay added this to the 1.12.0 milestone Oct 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation Issues related to the SciPy documentation. Also check https://github.com/scipy/scipy.org good first issue Good topic for first contributor pull requests, with a relatively straightforward solution scipy.spatial
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants