BM25 Index Corruption / BM25 not crash-resistant #4125

etiennedi · 2024-02-02T13:05:00Z

How to reproduce this bug?

Not known yet - working on reproduction. Few independent sightings

What is the expected behavior?

BM25 index is crash-resistant

What is the actual behavior?

Users occasionally see panics like the following:

panic: runtime error: index out of range [16] with length 16

goroutine 50113 [running]:
github.com/weaviate/weaviate/adapters/repos/db/inverted.(*BM25Searcher).createTerm(0xc09dd078c0, 0x413b3f1f00000000, {0x0, 0x0}, {0xc002f08edd, 0xe}, {0xc002a2ef00, 0xe, 0x10?}, 0xc03bb94fc0?, ...)
	/go/src/github.com/weaviate/weaviate/adapters/repos/db/inverted/bm25_searcher.go:431 +0x126e
github.com/weaviate/weaviate/adapters/repos/db/inverted.(*BM25Searcher).wand.func1()

Supporting information

We are aware and investigating. If you have a reliable way to reproduce this (from scratch), please do let us know.

Server Version

Spotted on 1.22.x and 1.23.x

Code of Conduct

I have read and agree to the Weaviate's Contributor Guide and Code of Conduct

The text was updated successfully, but these errors were encountered:

etiennedi · 2024-02-02T13:09:40Z

Known Mitigation techniques

Multi-Tenancy

If MT, the bug typically only affects a single or very few tenants. Here, the easiest would be to reinsert the entire tenant (using the Cursor API): Export all objects for the tenant. Delete the tenant, recreate the tenant, reinsert the data.

Single-Tenancy

Essentially, the same mitigation as for MT (whole collection instead of tenant), but it may not always be feasible due to the size. If possible, restore a backup from a point before the issue occurred.

Known workarounds

Pure Vector Search instead of Hybrid

The issue is isolated to BM25 indexing. If you are using Hybrid Search and running into this issue, one option could be to temporarily use pure vector search.

etiennedi · 2024-02-03T08:49:50Z

Possibly related: #3548

etiennedi · 2024-02-03T10:20:27Z

We are optimistic that we can provide a fix soon. Some updates:

We have successfully reproduced this starting with a clean instance for the first this morning
We have a first candidate for a patch, if we run our reproducing scripts with the patched version, no corruption occurs anymore
We also fairly confident that we can apply a patch that will automatically mitigate this for already corrupted clusters, it would then simply skip corrupted elements, but still allow a search on valid elements.
- This is only meant as an emergency fix, since the corruption itself is fairly unpredictable, a better way would be to reingest corrupt tenants and collections once the patch is live.

etiennedi · 2024-02-03T19:09:16Z

Checklist for Rollout

Before Release

fix root cause so no new corruptions can appear (in progress by @jeroiraz)
- make sure fix does don't discard entire WALs where some writes have already been ack'ed to the user (@jeroiraz has ideas how to do this)
- validate again that fix does not lead to new corruptions when crashes occur during batch delete. Update Feb 4: This can now be done automatically on CI using the new chaos pipeline (see below).
fix handling of already corrupted datasets (@aliszka and @amourao have context here. The assumption is that an incorrect pre-allocation is the culprit for the out-of-range panic)
- validate using the existing reproduction script or using data dump of an already corrupt setup
in Handle internal errors in BM25/WAND logic more gracefully #4100 we added panic-handling to BM25. However, as part of the investigation we have found out that this does not cover all code paths. This only covers the "BM25" (single prop) path, but if you take the "BM25F" path panics are not handled yet. Todo: also handle panics on other code path.

Release

release as v1.23.x patch
release as v1.22.x patch
- if porting to both releases is a lot of effort, prefer v1.23.x and release v1.22.x later, so clusters already running the latest version can be patched first
align internally (WCS/SRE) that all commercial clusters get patched immediately
align with the DevRel team to tell the community to patch their cluster immediately

Can happen after initial release

turn the current reproduction scripts for the BM25 corruption into an automated chaos pipeline
- the difficulty here is that the crash needs to happen at the right moment. One idea could be to make the import script also be the killer script. Then when the start delete cycle phase happens, concurrently sleep for a short random time period, then kill the script.
- Update Feb 4: The pipeline is now available here: Add BM25 corruption pipeline: Crashing during batch deletes weaviate-chaos-engineering#172
- TODO: merge pipeline into main. Can only be done once a fix is available, so it will be green.
investigate sighting of WARN[0007] Skipping object in BM25: object with id 36557 has a length of 0 bytes. action=bm25_search class=Book shard=qpxiJ3I5BbV6.
- The current theory is that this is just because there is no transactional guarantee between the object store and the inverted index. This could simply mean that the object was already deleted in the object store, but not yet in the inverted index. If correct, then this is not a big issue and could be rectified through a lazy repair
- (maybe) lazily repair stale inverted index entry if obj is already deleted
investigate if other issues that were spotted as part of this investigation (e.g. Crahsloop with panic: assignment to entry in nil map #4128) are also fixed by the current changes. If not, plan for fixing those

etiennedi · 2024-02-04T13:07:01Z

Update: A chaos pipeline that reproduces the issue fully automatically (starting with an empty instance) is now available: weaviate/weaviate-chaos-engineering#172

tickx-cegeka · 2024-02-07T15:36:56Z

Hello @etiennedi
This is an important bug for us that holds us back from moving to the v4 client.
Do you have any new estimate on when this fix would be released? Thank you.

tickx-cegeka · 2024-02-12T08:40:25Z

@amourao Could you tell me if this bug was fully fixed with your PR?

etiennedi · 2024-02-12T13:48:27Z

This is an important bug for us that holds us back from moving to the v4 client.
Do you have any new estimate on when this fix would be released? Thank you.

This should not have any relation to the v4 client.

@amourao Could you tell me if this bug was fully fixed with your PR?

It can mitigate it in many scenarios. I wouldn't call it "fully fixed" because more crash-safety improvements are coming. But it can't hurt to try if it fixes all issues for you already.

tickx-cegeka · 2024-02-12T13:49:56Z

@etiennedi Thank you, I will test it out soon.
We only saw this happening as soon as we refactored our code to the v4 client though.

Edit: did not see it happen again, thanks.

chriscasola · 2024-04-23T12:32:26Z

Is there any update on this issue? This comment made is sound like a fix was in sight but there haven't been any recent updates.

rthiiyer82 · 2024-05-23T11:42:42Z

#4262 -> addresses (maybe) lazily repair stale inverted index entry if obj is already deleted

etiennedi added the bug label Feb 2, 2024

etiennedi mentioned this issue Feb 2, 2024

Crahsloop with panic: assignment to entry in nil map #4128

Open

1 task

etiennedi mentioned this issue Feb 4, 2024

Interrupted batch delete results in index corruption #3548

Open

amourao mentioned this issue Feb 8, 2024

Write Tombstones to LSM properly and handle badly formed records #4167

Merged

4 tasks

This was referenced Feb 14, 2024

Async Index - Bug Failure #4109

Closed

Startup failure after crash #4038

Open

amourao added a commit that referenced this issue Feb 16, 2024

Reduce logging message frequency for non-recovered objects #4125

a861145

amourao mentioned this issue Feb 16, 2024

Reduce logging message frequency for non-recovered objects #4242

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BM25 Index Corruption / BM25 not crash-resistant #4125

BM25 Index Corruption / BM25 not crash-resistant #4125

etiennedi commented Feb 2, 2024

etiennedi commented Feb 2, 2024 •

edited

etiennedi commented Feb 3, 2024

etiennedi commented Feb 3, 2024

etiennedi commented Feb 3, 2024 •

edited by rthiiyer82

etiennedi commented Feb 4, 2024

tickx-cegeka commented Feb 7, 2024 •

edited

tickx-cegeka commented Feb 12, 2024

etiennedi commented Feb 12, 2024

tickx-cegeka commented Feb 12, 2024 •

edited

chriscasola commented Apr 23, 2024

rthiiyer82 commented May 23, 2024

BM25 Index Corruption / BM25 not crash-resistant #4125

BM25 Index Corruption / BM25 not crash-resistant #4125

Comments

etiennedi commented Feb 2, 2024

How to reproduce this bug?

What is the expected behavior?

What is the actual behavior?

Supporting information

Server Version

Code of Conduct

etiennedi commented Feb 2, 2024 • edited

Known Mitigation techniques

Multi-Tenancy

Single-Tenancy

Known workarounds

Pure Vector Search instead of Hybrid

etiennedi commented Feb 3, 2024

etiennedi commented Feb 3, 2024

etiennedi commented Feb 3, 2024 • edited by rthiiyer82

Checklist for Rollout

Before Release

Release

Can happen after initial release

etiennedi commented Feb 4, 2024

tickx-cegeka commented Feb 7, 2024 • edited

tickx-cegeka commented Feb 12, 2024

etiennedi commented Feb 12, 2024

tickx-cegeka commented Feb 12, 2024 • edited

chriscasola commented Apr 23, 2024

rthiiyer82 commented May 23, 2024

etiennedi commented Feb 2, 2024 •

edited

etiennedi commented Feb 3, 2024 •

edited by rthiiyer82

tickx-cegeka commented Feb 7, 2024 •

edited

tickx-cegeka commented Feb 12, 2024 •

edited