Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: index.search returns invalid keys when k > index size #393

Closed
2 tasks done
andersonbcdefg opened this issue Apr 10, 2024 · 16 comments
Closed
2 tasks done

Bug: index.search returns invalid keys when k > index size #393

andersonbcdefg opened this issue Apr 10, 2024 · 16 comments
Labels
bug Something isn't working

Comments

@andersonbcdefg
Copy link

Describe the bug

When creating a small index (size ~100) and searching it, I noticed it always returns some extremely large keys that are not actually present in the index. Here's an example of my own print debugging output:

Warning: 92 samples not in index_labels: [94264669291728, 119, 9223372036854775808, 94264674115280, 94264674115312, 208, 94278552092195, 139415592574144, 9223372036854775808, 753, 94264669262496, 94264673534624, 94264668185904, 94264676614832, 94264661017696, 94264673538080, 94264676662528, 94264682064784, 94264681824960, 107, 94264655611472, 94264655611504, 94264676636832, 113, 94264676436544, 186, 94264675185232, 94264675185264, 94264662133040, 189, 94264676662400, 94264676662432, 94264676992352, 94264673533232, 94264672356160, 257, 94264674115376, 208, 94278552062388, 139415592574144, 6305, 94264669268912, 139415592574144, 94264676412336, 94264676412304, 208, 94278552063349, 139415592574144, 5345, 94264669268912, 139415592574144, 2256, 94264669235504, 94264669279184, 94264674131696, 94264680759264, 94264669315648, 94264669315680, 94264669259168, 94264669259200, 94264676396400, 94264676396432, 94264676396464, 94264676396496, 94264674051952, 94264674051984, 114, 9223372036854775808, 94264676462544, 208, 94278552066357, 139415592574144, 11457, 94264669235920, 139415592574144, 94261647247312, 94264674130320, 208, 94278552067317, 139415592574144, 10497, 94264669235920, 139415592574144, 94261647247309, 94264669280224, 182, 9223372036854775808, 23013835270, 94278552069446, 94264669295984, 94264669271152, 94278552069510]

On further debugging, I realized this happens when I search for more "neighbors" than there are points in the index.

Steps to reproduce

Here is how I get the error: I create an Index, embed some documents (around 100) and then insert them. When I search for > 100 neighbors, I get back invalid keys.

Expected behavior

index.search should only return keys that are present in the index. if searching for fewer items than there are in the index, i would expect to either only get back the items in the index (i.e. fewer than I asked for!), or else raise some kind of error message. silently returning invalid keys is not user-friendly! (in my opinion)

USearch version

2.11.3

Operating System

Debian slim-bookworm

Hardware architecture

x86

Which interface are you using?

Python bindings

Contact Details

andersonbcdefg@gmail.com

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct
@andersonbcdefg andersonbcdefg added the bug Something isn't working label Apr 10, 2024
@ashvardanian
Copy link
Contributor

Great catch, @andersonbcdefg! On it ;)

@ashvardanian
Copy link
Contributor

If you'd have a second to submit a PR with a minimal test case, that would help 🤗

ashvardanian added a commit that referenced this issue Apr 11, 2024
@ashvardanian
Copy link
Contributor

@andersonbcdefg I have tried reproducing your issue and couldn't locate it in either C++ or Python layer. Please check out the 7db0c39 🤗

ashvardanian pushed a commit that referenced this issue Apr 11, 2024
## [2.11.4](v2.11.3...v2.11.4) (2024-04-11)

### Fix

* `#[repr(transparent)]` for `f16` and `b1x8` ([e182d77](e182d77))
* Non-native `f16` haversine ([c7922ff](c7922ff))

### Make

* Bump versions in `CMakeLists.txt` ([aadb717](aadb717))
* Rever WASI CI ([92e0b94](92e0b94))

### Test

* Oversubscribed search ([7db0c39](7db0c39)), closes [#393](#393)
* Re-seed NumPy PRNG ([8de87df](8de87df))
@andersonbcdefg
Copy link
Author

Nice! Thanks for the quick fix here :)

@ashvardanian
Copy link
Contributor

Not really a fix, @andersonbcdefg, as I couldn’t reproduce, but please lmk if the issue persists 🤗

@fmmoret
Copy link

fmmoret commented Jun 17, 2024

I'm actually encountering this same issue.

from usearch.index import Index
import numpy as np

make_index = lambda: Index(
    ndim=128,
    metric="ip",
    dtype="f32",
    connectivity=16,
    expansion_add=128,
    expansion_search=64,
)
k_index = make_index()

times = 100

results = [
    k_index.search(np.random.rand(1000, 128), count=50).keys != 0 for _ in range(times)
]

print(sum(sum([sum(r) for r in results])))
1367

This repro constitently produces 1367 random, extremely high values on my machine.

Strangely, the number goes down to 490 if I compare to 0 later:

from usearch.index import Index
import numpy as np

make_index = lambda: Index(
    ndim=128,
    metric="ip",
    dtype="f32",
    connectivity=16,
    expansion_add=128,
    expansion_search=64,
)
k_index = make_index()

times = 100

results = [
    k_index.search(np.random.rand(1000, 128), count=50).keys for _ in range(times)
]

nonzero_results = [r != 0 for r in results]

print(sum(sum([sum(r) for r in nonzero_results])))
print(np.unique(results))
490
[              0            8016          123105          123137
          123185          123217          123265          123297
          123345          123377          123425          123457
          123505          123537          123585          123617
          123665          123697          123745          123777
          123857          123889          123937          123969
          124017          124049          124097          124129
          124177          124209          124257          124289
          124337          124369          124417          124449
          124497          124529          124577          124609
          124657          124689          124737          124769
          124817          124849          124897          124929
          124977          125009          125057          125089
          125169          125249          125281          125329
          125361          125409          125441          125489
          125521          125569          125601          125649
          125681          125729          125761          125809
          125841          125889          125921          125969
          126001          126049          126081          126129
          126161          126209          126241          126289
          126321          126369          126401          126449
          126481          126561          126641          126673
          126721          126753          126801          126833
          126881          126913          126961          126993
          127041          127073          127121          127153
          200016  94306974031664  94306974072400  94306974369424
  94306974450880  94306974772016  94306975001600  94306975008544
  94306975048000  94306975270672  94306976378432  94306977388160
  94306977388320  94306977656784  94306977720144  94306977728704
  94306977773760  94306977813520  94306977891536 139821664062688]

Lastly, adding a sleep before I do the read:

from usearch.index import Index
import numpy as np

make_index = lambda: Index(
 ndim=128,
 metric="ip",
 dtype="f32",
 connectivity=16,
 expansion_add=128,
 expansion_search=64,
)
k_index = make_index()

times = 100

results = [
 k_index.search(np.random.rand(1000, 128), count=50).keys for _ in range(times)
]

import time

time.sleep(10)

nonzero_results = [r != 0 for r in results]

print(sum(sum([sum(r) for r in nonzero_results])))
print(np.unique(results))

results in all 0s / no crazy values.

@fmmoret
Copy link

fmmoret commented Jun 17, 2024

Should probably reopen @ashvardanian

@ashvardanian
Copy link
Contributor

@fmmoret please check the other members of the returned structure, not just keys. Let me know if the issue persists 🤗

@fmmoret
Copy link

fmmoret commented Jun 18, 2024

Which ones are you interested in?

@fmmoret
Copy link

fmmoret commented Jun 18, 2024

An iterator over the search results or a .to_list() mitigate the error but they are comparatively very slow.

@fmmoret
Copy link

fmmoret commented Jun 18, 2024

In my application, I'm doing lots of searches against multiple indices (some searches against the same index). I'm using locks around searching the same index.
Under high load (with 1000-3000 values stored per index, 8 indices, 32 searches total (4 searches per index) of topk = 10 with 1000 vectors per query) I eventually get back bad values. It takes steady stream of load but eventually encounters it.

matches_list: List[BatchMatches] = bulk_search(vectors)
print([len(m) for m in matches_list])
print([sum(m.keys > 10000) for m in matches_list]) # NOTE: my keys are not >10000
[1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000]
[array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([632, 632, 632, 632, 632, 631, 632, 631, 631, 631]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])]

EDIT:
this looks like this is the just when usearch.BatchMatches(0 across 1000 queries) - so same problem as in repro.

@fmmoret
Copy link

fmmoret commented Jun 18, 2024

Weird: if I replay the same search on these ones with 0 matches, they come back with full matches. Very strange behavior

@fmmoret
Copy link

fmmoret commented Jun 18, 2024


I got around this issue with

while k_index.size < expected_index_size:
    #  sleep or throw timeout

The python impl doesn't have a great way to check that indexing finished. The progress callback shows progress at intervals but does not tell you when indexing actually finished.

@ashvardanian
Copy link
Contributor

ashvardanian commented Jun 18, 2024

If you check the object it has just 3 array fields - keys, distances, and counts. You need to use last to navigate the results. Or just the subscript operator, that I've also overloaded. You can find all the symbols in the official Python docs, including BatchMatches.

@ashvardanian
Copy link
Contributor

Under high load (with 1000-3000 values stored per index, 8 indices, 32 searches total (4 searches per index) of topk = 10 with 1000 vectors per query) I eventually get back bad values. It takes steady stream of load but eventually encounters it.

This is a bad idea. Vector Search structures have logarithmic complexity for most operations. So you want to avoid sharding. The bigger the index, the more efficient is search compared to brute force.

Moreover, if you are dealing with only thousands of vectors, you don't need a vector search index at all and may consider using SimSIMD.

@fmmoret
Copy link

fmmoret commented Jun 18, 2024

I understand your concern -- the case I shared is a toy example at local dev scale. My full scale app has multiple indices w/ millions of vectors. There are multiple indices because they are using fully different embedding spaces.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants