Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BM25 Index Corruption / BM25 not crash-resistant #4125

Open
1 task done
etiennedi opened this issue Feb 2, 2024 · 11 comments
Open
1 task done

BM25 Index Corruption / BM25 not crash-resistant #4125

etiennedi opened this issue Feb 2, 2024 · 11 comments
Labels

Comments

@etiennedi
Copy link
Member

How to reproduce this bug?

Not known yet - working on reproduction. Few independent sightings

What is the expected behavior?

BM25 index is crash-resistant

What is the actual behavior?

Users occasionally see panics like the following:

panic: runtime error: index out of range [16] with length 16

goroutine 50113 [running]:
github.com/weaviate/weaviate/adapters/repos/db/inverted.(*BM25Searcher).createTerm(0xc09dd078c0, 0x413b3f1f00000000, {0x0, 0x0}, {0xc002f08edd, 0xe}, {0xc002a2ef00, 0xe, 0x10?}, 0xc03bb94fc0?, ...)
	/go/src/github.com/weaviate/weaviate/adapters/repos/db/inverted/bm25_searcher.go:431 +0x126e
github.com/weaviate/weaviate/adapters/repos/db/inverted.(*BM25Searcher).wand.func1()

Supporting information

We are aware and investigating. If you have a reliable way to reproduce this (from scratch), please do let us know.

Server Version

Spotted on 1.22.x and 1.23.x

Code of Conduct

@etiennedi etiennedi added the bug label Feb 2, 2024
@etiennedi
Copy link
Member Author

etiennedi commented Feb 2, 2024

Known Mitigation techniques

Multi-Tenancy

If MT, the bug typically only affects a single or very few tenants. Here, the easiest would be to reinsert the entire tenant (using the Cursor API): Export all objects for the tenant. Delete the tenant, recreate the tenant, reinsert the data.

Single-Tenancy

Essentially, the same mitigation as for MT (whole collection instead of tenant), but it may not always be feasible due to the size. If possible, restore a backup from a point before the issue occurred.

Known workarounds

Pure Vector Search instead of Hybrid

The issue is isolated to BM25 indexing. If you are using Hybrid Search and running into this issue, one option could be to temporarily use pure vector search.

@etiennedi
Copy link
Member Author

Possibly related: #3548

@etiennedi
Copy link
Member Author

We are optimistic that we can provide a fix soon. Some updates:

  • We have successfully reproduced this starting with a clean instance for the first this morning
  • We have a first candidate for a patch, if we run our reproducing scripts with the patched version, no corruption occurs anymore
  • We also fairly confident that we can apply a patch that will automatically mitigate this for already corrupted clusters, it would then simply skip corrupted elements, but still allow a search on valid elements.
    • This is only meant as an emergency fix, since the corruption itself is fairly unpredictable, a better way would be to reingest corrupt tenants and collections once the patch is live.

@etiennedi
Copy link
Member Author

etiennedi commented Feb 3, 2024

Checklist for Rollout

Before Release

  • fix root cause so no new corruptions can appear (in progress by @jeroiraz)
    • make sure fix does don't discard entire WALs where some writes have already been ack'ed to the user (@jeroiraz has ideas how to do this)
    • validate again that fix does not lead to new corruptions when crashes occur during batch delete. Update Feb 4: This can now be done automatically on CI using the new chaos pipeline (see below).
  • fix handling of already corrupted datasets (@aliszka and @amourao have context here. The assumption is that an incorrect pre-allocation is the culprit for the out-of-range panic)
    • validate using the existing reproduction script or using data dump of an already corrupt setup
  • in Handle internal errors in BM25/WAND logic more gracefully #4100 we added panic-handling to BM25. However, as part of the investigation we have found out that this does not cover all code paths. This only covers the "BM25" (single prop) path, but if you take the "BM25F" path panics are not handled yet. Todo: also handle panics on other code path.

Release

  • release as v1.23.x patch
  • release as v1.22.x patch
    • if porting to both releases is a lot of effort, prefer v1.23.x and release v1.22.x later, so clusters already running the latest version can be patched first
  • align internally (WCS/SRE) that all commercial clusters get patched immediately
  • align with the DevRel team to tell the community to patch their cluster immediately

Can happen after initial release

  • turn the current reproduction scripts for the BM25 corruption into an automated chaos pipeline
    • the difficulty here is that the crash needs to happen at the right moment. One idea could be to make the import script also be the killer script. Then when the start delete cycle phase happens, concurrently sleep for a short random time period, then kill the script.
    • Update Feb 4: The pipeline is now available here: Add BM25 corruption pipeline: Crashing during batch deletes weaviate-chaos-engineering#172
    • TODO: merge pipeline into main. Can only be done once a fix is available, so it will be green.
  • investigate sighting of WARN[0007] Skipping object in BM25: object with id 36557 has a length of 0 bytes. action=bm25_search class=Book shard=qpxiJ3I5BbV6.
    • The current theory is that this is just because there is no transactional guarantee between the object store and the inverted index. This could simply mean that the object was already deleted in the object store, but not yet in the inverted index. If correct, then this is not a big issue and could be rectified through a lazy repair
    • (maybe) lazily repair stale inverted index entry if obj is already deleted
  • investigate if other issues that were spotted as part of this investigation (e.g. Crahsloop with panic: assignment to entry in nil map #4128) are also fixed by the current changes. If not, plan for fixing those

@etiennedi
Copy link
Member Author

Update: A chaos pipeline that reproduces the issue fully automatically (starting with an empty instance) is now available: weaviate/weaviate-chaos-engineering#172

@tickx-cegeka
Copy link

tickx-cegeka commented Feb 7, 2024

Hello @etiennedi
This is an important bug for us that holds us back from moving to the v4 client.
Do you have any new estimate on when this fix would be released? Thank you.

@tickx-cegeka
Copy link

@amourao Could you tell me if this bug was fully fixed with your PR?

@etiennedi
Copy link
Member Author

This is an important bug for us that holds us back from moving to the v4 client.
Do you have any new estimate on when this fix would be released? Thank you.

This should not have any relation to the v4 client.

@amourao Could you tell me if this bug was fully fixed with your PR?

It can mitigate it in many scenarios. I wouldn't call it "fully fixed" because more crash-safety improvements are coming. But it can't hurt to try if it fixes all issues for you already.

@tickx-cegeka
Copy link

tickx-cegeka commented Feb 12, 2024

@etiennedi Thank you, I will test it out soon.
We only saw this happening as soon as we refactored our code to the v4 client though.

Edit: did not see it happen again, thanks.

@chriscasola
Copy link

Is there any update on this issue? This comment made is sound like a fix was in sight but there haven't been any recent updates.

@rthiiyer82
Copy link

#4262 -> addresses (maybe) lazily repair stale inverted index entry if obj is already deleted

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants