Bug: index.search returns invalid keys when k > index size #393

andersonbcdefg · 2024-04-10T22:48:56Z

Describe the bug

When creating a small index (size ~100) and searching it, I noticed it always returns some extremely large keys that are not actually present in the index. Here's an example of my own print debugging output:

Warning: 92 samples not in index_labels: [94264669291728, 119, 9223372036854775808, 94264674115280, 94264674115312, 208, 94278552092195, 139415592574144, 9223372036854775808, 753, 94264669262496, 94264673534624, 94264668185904, 94264676614832, 94264661017696, 94264673538080, 94264676662528, 94264682064784, 94264681824960, 107, 94264655611472, 94264655611504, 94264676636832, 113, 94264676436544, 186, 94264675185232, 94264675185264, 94264662133040, 189, 94264676662400, 94264676662432, 94264676992352, 94264673533232, 94264672356160, 257, 94264674115376, 208, 94278552062388, 139415592574144, 6305, 94264669268912, 139415592574144, 94264676412336, 94264676412304, 208, 94278552063349, 139415592574144, 5345, 94264669268912, 139415592574144, 2256, 94264669235504, 94264669279184, 94264674131696, 94264680759264, 94264669315648, 94264669315680, 94264669259168, 94264669259200, 94264676396400, 94264676396432, 94264676396464, 94264676396496, 94264674051952, 94264674051984, 114, 9223372036854775808, 94264676462544, 208, 94278552066357, 139415592574144, 11457, 94264669235920, 139415592574144, 94261647247312, 94264674130320, 208, 94278552067317, 139415592574144, 10497, 94264669235920, 139415592574144, 94261647247309, 94264669280224, 182, 9223372036854775808, 23013835270, 94278552069446, 94264669295984, 94264669271152, 94278552069510]

On further debugging, I realized this happens when I search for more "neighbors" than there are points in the index.

Steps to reproduce

Here is how I get the error: I create an Index, embed some documents (around 100) and then insert them. When I search for > 100 neighbors, I get back invalid keys.

Expected behavior

index.search should only return keys that are present in the index. if searching for fewer items than there are in the index, i would expect to either only get back the items in the index (i.e. fewer than I asked for!), or else raise some kind of error message. silently returning invalid keys is not user-friendly! (in my opinion)

USearch version

2.11.3

Operating System

Debian slim-bookworm

Hardware architecture

x86

Which interface are you using?

Python bindings

Contact Details

andersonbcdefg@gmail.com

Is there an existing issue for this?

I have searched the existing issues

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

ashvardanian · 2024-04-10T23:05:36Z

Great catch, @andersonbcdefg! On it ;)

ashvardanian · 2024-04-10T23:06:42Z

If you'd have a second to submit a PR with a minimal test case, that would help 🤗

Relates to #393

ashvardanian · 2024-04-11T03:55:43Z

@andersonbcdefg I have tried reproducing your issue and couldn't locate it in either C++ or Python layer. Please check out the 7db0c39 🤗

## [2.11.4](v2.11.3...v2.11.4) (2024-04-11) ### Fix * `#[repr(transparent)]` for `f16` and `b1x8` ([e182d77](e182d77)) * Non-native `f16` haversine ([c7922ff](c7922ff)) ### Make * Bump versions in `CMakeLists.txt` ([aadb717](aadb717)) * Rever WASI CI ([92e0b94](92e0b94)) ### Test * Oversubscribed search ([7db0c39](7db0c39)), closes [#393](#393) * Re-seed NumPy PRNG ([8de87df](8de87df))

andersonbcdefg · 2024-04-11T05:34:41Z

Nice! Thanks for the quick fix here :)

ashvardanian · 2024-04-11T05:42:53Z

Not really a fix, @andersonbcdefg, as I couldn’t reproduce, but please lmk if the issue persists 🤗

fmmoret · 2024-06-17T23:58:34Z

I'm actually encountering this same issue.

from usearch.index import Index
import numpy as np

make_index = lambda: Index(
    ndim=128,
    metric="ip",
    dtype="f32",
    connectivity=16,
    expansion_add=128,
    expansion_search=64,
)
k_index = make_index()

times = 100

results = [
    k_index.search(np.random.rand(1000, 128), count=50).keys != 0 for _ in range(times)
]

print(sum(sum([sum(r) for r in results])))

This repro constitently produces 1367 random, extremely high values on my machine.

Strangely, the number goes down to 490 if I compare to 0 later:

from usearch.index import Index
import numpy as np

make_index = lambda: Index(
    ndim=128,
    metric="ip",
    dtype="f32",
    connectivity=16,
    expansion_add=128,
    expansion_search=64,
)
k_index = make_index()

times = 100

results = [
    k_index.search(np.random.rand(1000, 128), count=50).keys for _ in range(times)
]

nonzero_results = [r != 0 for r in results]

print(sum(sum([sum(r) for r in nonzero_results])))
print(np.unique(results))

490
[              0            8016          123105          123137
          123185          123217          123265          123297
          123345          123377          123425          123457
          123505          123537          123585          123617
          123665          123697          123745          123777
          123857          123889          123937          123969
          124017          124049          124097          124129
          124177          124209          124257          124289
          124337          124369          124417          124449
          124497          124529          124577          124609
          124657          124689          124737          124769
          124817          124849          124897          124929
          124977          125009          125057          125089
          125169          125249          125281          125329
          125361          125409          125441          125489
          125521          125569          125601          125649
          125681          125729          125761          125809
          125841          125889          125921          125969
          126001          126049          126081          126129
          126161          126209          126241          126289
          126321          126369          126401          126449
          126481          126561          126641          126673
          126721          126753          126801          126833
          126881          126913          126961          126993
          127041          127073          127121          127153
          200016  94306974031664  94306974072400  94306974369424
  94306974450880  94306974772016  94306975001600  94306975008544
  94306975048000  94306975270672  94306976378432  94306977388160
  94306977388320  94306977656784  94306977720144  94306977728704
  94306977773760  94306977813520  94306977891536 139821664062688]

Lastly, adding a sleep before I do the read:

from usearch.index import Index
import numpy as np

make_index = lambda: Index(
 ndim=128,
 metric="ip",
 dtype="f32",
 connectivity=16,
 expansion_add=128,
 expansion_search=64,
)
k_index = make_index()

times = 100

results = [
 k_index.search(np.random.rand(1000, 128), count=50).keys for _ in range(times)
]

import time

time.sleep(10)

nonzero_results = [r != 0 for r in results]

print(sum(sum([sum(r) for r in nonzero_results])))
print(np.unique(results))

results in all 0s / no crazy values.

fmmoret · 2024-06-17T23:59:11Z

Should probably reopen @ashvardanian

ashvardanian · 2024-06-18T00:23:47Z

@fmmoret please check the other members of the returned structure, not just keys. Let me know if the issue persists 🤗

fmmoret · 2024-06-18T00:24:32Z

Which ones are you interested in?

fmmoret · 2024-06-18T00:32:29Z

An iterator over the search results or a .to_list() mitigate the error but they are comparatively very slow.

fmmoret · 2024-06-18T01:07:27Z

In my application, I'm doing lots of searches against multiple indices (some searches against the same index). I'm using locks around searching the same index.
Under high load (with 1000-3000 values stored per index, 8 indices, 32 searches total (4 searches per index) of topk = 10 with 1000 vectors per query) I eventually get back bad values. It takes steady stream of load but eventually encounters it.

matches_list: List[BatchMatches] = bulk_search(vectors)
print([len(m) for m in matches_list])
print([sum(m.keys > 10000) for m in matches_list]) # NOTE: my keys are not >10000

[1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000]
[array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([632, 632, 632, 632, 632, 631, 632, 631, 631, 631]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])]

EDIT:
this looks like this is the just when usearch.BatchMatches(0 across 1000 queries) - so same problem as in repro.

fmmoret · 2024-06-18T02:08:05Z

Weird: if I replay the same search on these ones with 0 matches, they come back with full matches. Very strange behavior

fmmoret · 2024-06-18T02:53:50Z

I got around this issue with

while k_index.size < expected_index_size:
    #  sleep or throw timeout

The python impl doesn't have a great way to check that indexing finished. The progress callback shows progress at intervals but does not tell you when indexing actually finished.

ashvardanian · 2024-06-18T05:59:06Z

If you check the object it has just 3 array fields - keys, distances, and counts. You need to use last to navigate the results. Or just the subscript operator, that I've also overloaded. You can find all the symbols in the official Python docs, including BatchMatches.

ashvardanian · 2024-06-18T06:05:11Z

Under high load (with 1000-3000 values stored per index, 8 indices, 32 searches total (4 searches per index) of topk = 10 with 1000 vectors per query) I eventually get back bad values. It takes steady stream of load but eventually encounters it.

This is a bad idea. Vector Search structures have logarithmic complexity for most operations. So you want to avoid sharding. The bigger the index, the more efficient is search compared to brute force.

Moreover, if you are dealing with only thousands of vectors, you don't need a vector search index at all and may consider using SimSIMD.

fmmoret · 2024-06-18T06:32:24Z

I understand your concern -- the case I shared is a toy example at local dev scale. My full scale app has multiple indices w/ millions of vectors. There are multiple indices because they are using fully different embedding spaces.

andersonbcdefg added the bug Something isn't working label Apr 10, 2024

ashvardanian added a commit that referenced this issue Apr 11, 2024

Test: Oversubscribed search

7db0c39

Relates to #393

ashvardanian closed this as completed Apr 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: index.search returns invalid keys when k > index size #393

Bug: index.search returns invalid keys when k > index size #393

andersonbcdefg commented Apr 10, 2024

ashvardanian commented Apr 10, 2024

ashvardanian commented Apr 10, 2024

ashvardanian commented Apr 11, 2024

andersonbcdefg commented Apr 11, 2024

ashvardanian commented Apr 11, 2024

fmmoret commented Jun 17, 2024

fmmoret commented Jun 17, 2024

ashvardanian commented Jun 18, 2024

fmmoret commented Jun 18, 2024

fmmoret commented Jun 18, 2024

fmmoret commented Jun 18, 2024 •

edited

Loading

fmmoret commented Jun 18, 2024

fmmoret commented Jun 18, 2024 •

edited

Loading

ashvardanian commented Jun 18, 2024 •

edited

Loading

ashvardanian commented Jun 18, 2024

fmmoret commented Jun 18, 2024

Bug: index.search returns invalid keys when k > index size #393

Bug: index.search returns invalid keys when k > index size #393

Comments

andersonbcdefg commented Apr 10, 2024

Describe the bug

Steps to reproduce

Expected behavior

USearch version

Operating System

Hardware architecture

Which interface are you using?

Contact Details

Is there an existing issue for this?

Code of Conduct

ashvardanian commented Apr 10, 2024

ashvardanian commented Apr 10, 2024

ashvardanian commented Apr 11, 2024

andersonbcdefg commented Apr 11, 2024

ashvardanian commented Apr 11, 2024

fmmoret commented Jun 17, 2024

fmmoret commented Jun 17, 2024

ashvardanian commented Jun 18, 2024

fmmoret commented Jun 18, 2024

fmmoret commented Jun 18, 2024

fmmoret commented Jun 18, 2024 • edited Loading

fmmoret commented Jun 18, 2024

fmmoret commented Jun 18, 2024 • edited Loading

ashvardanian commented Jun 18, 2024 • edited Loading

ashvardanian commented Jun 18, 2024

fmmoret commented Jun 18, 2024

fmmoret commented Jun 18, 2024 •

edited

Loading

fmmoret commented Jun 18, 2024 •

edited

Loading

ashvardanian commented Jun 18, 2024 •

edited

Loading