BM25 returns no results in some situations (Original title: BM25/Tokenizer not working properly) #3517

netapy · 2023-09-08T18:53:29Z

Hi –

I have an Article class, which contains many french legal articles.

Here is the class initiation :

class_obj = {
    "class": "Article",
    "description": "Articles des différentes codes de la loi.",
    "invertedIndexConfig": {
        "stopwords": {
            "preset": "none",
            #"additions": stopwords_fr
        }
    },
    "vectorizer": "text2vec-transformers",
    "properties": [
        {
            "name": "article_id",
            "description": "Id unique de l'article.",
            "dataType": [
                "text"
            ],
            "indexFilterable": True,
            "indexSearchable": False, 
        },
        {
            "name": "source",
            "description": "Titre de la source juridique (code, loi ou ordonnance) contenant l'article.",
            "dataType": [
                "text"
            ],
            "tokenization": "lowercase",
            "indexFilterable": True,
            "indexSearchable": True, 
        },
        {
            "name": "titre",
            "description": "Le titre de l'article.",
            "dataType": [
                "text"
            ],
            "indexFilterable": True,
            "indexSearchable": True, 
            "tokenization": "lowercase"
        },
        {
            "name": "texte",
            "description": "Le texte de l'article, en html.",
            "dataType": [
                "text"
            ],
            "moduleConfig": {
                "text2vec-transformers": {
                    "skip": False,
                    "vectorizePropertyName": False
                }
            },
            "indexFilterable": True,
            "indexSearchable": True, 
        },
        {
            "name": "etat",
            "description": "Etat de l'article : en vigueur, abrogé...",
            "dataType": [
                "text"
            ],
            "indexFilterable": True,
            "indexSearchable": False, 
        },
        {
            "name": "path_title",
            "description": "Chemin daccès à l'article",
            "dataType": [
                "text[]"
            ],
            "indexFilterable": True,
            "indexSearchable": True, 
        },
        {
            "name": "ref_textes",
            "description": "Références avec d'autres textes.",
            "dataType": [
                "text"
            ],
            "indexFilterable": True,
            "indexSearchable": False, 
        },
        {
            "name": "order",
            "description": "Ordre de l'article dans le code.",
            "dataType": [
                "int"
            ],
            "indexFilterable": True,
            "indexSearchable": False,
        },
        {
            "name": "date_deb",
            "description": "Date de début de l'article.",
            "dataType": [
                "int"
            ],
            "indexFilterable": True,
            "indexSearchable": False,
        },
        {
            "name": "date_fin",
            "description": "Date de fin de l'article.",
            "dataType": [
                "int"
            ],
            "indexFilterable": True,
            "indexSearchable": False,
        },
    ]
}

client.schema.create_class(class_obj)

I want to let my users search for legal articles through their "titre" and "source" properties, which have been tokenized using lowercase.

Here is an example of an article I'm trying to find :

{
    "article_id": "JORFARTI000047663197",
    "etat": "VIGUEUR",
    "path_title": [
        "Titre IER : DE LA NATURE DE L'ACTIVITÉ D'INFLUENCE COMMERCIALE PAR VOIE ÉLECTRONIQUE ET DES OBLIGATIONS AFFÉRENTES À SON EXERCICE",
        "Chapitre III : Dispositions générales relatives à l'activité d'agent d'influenceur, aux contrats d'influence commerciale par voie électronique, à la responsabilité civile solidaire et à l'assurance civile professionnelle"
    ],
    "source": "LOI n° 2023-451 du 9 juin 2023 visant à encadrer l'influence commerciale et à lutter contre les dérives des influenceurs sur les réseaux sociaux (1)",
    "texte": ".....blablabla....",
    "titre": "7"
},

Using the following query :

query {
    Get {
    Article(
        limit: 5,
        bm25: {
            query: "LOI 9 juin 2023 visant à encadrer l'influence"
            properties: ["source^3", "titre"]
        }
        ) {
            article_id
            titre
            path_title
            texte
            source
            _additional {
                score
            }                
        }
    }
}

...gives me the following results :

# only printing the "titre" and "source" in a list
[['2023', 'Code civil'], ["Annexe 9 à l'article A4241-50-2", 'Code des transports'], ["Annexe à l'article R*351-1, art. 9", 'Code des ports maritimes'], ['437 à 614-26', 'Code de commerce (ancien)'], ['L79 à L85', 'Code électoral']]

Am i missing something ?
It seems like it should be able to find it because i'm literally copy/pasting the exact source name.
Could it be an issue with the lowercase tokenizer ?

I'd be happy to provide you with further information if needed.

Thanks in advance.

The text was updated successfully, but these errors were encountered:

donomii · 2023-09-08T23:26:47Z

Could you do the search with only the source property (remove the titre)?

e.g.

query {
Get {
Article(
limit: 5,
bm25: {
query: "LOI 9 juin 2023 visant à encadrer l'influence"
properties: ["source"]
}
) {
article_id
titre
path_title
texte
source
_additional {
score
}
}
}
}

donomii · 2023-09-08T23:27:18Z

Also, if you raise the limit (e.g. to 100), do the correct results appear in the list?

netapy · 2023-09-09T05:45:29Z

@donomii Thanks for your answer.

Could you do the search with only the source property (remove the titre)?

I just tried, and weirdly enough it just gives me 0 results – doesn't find anything (out of +300k articles)

Also, if you raise the limit (e.g. to 100), do the correct results appear in the list?

It doesn't show up.

It really seems that the source property has not been tokenized.

I might try with the default "word" tokenization later to see if it solves it. Will keep this updated

netapy · 2023-09-09T07:33:05Z

Also, for what it's worth, when I batch add my Articles I directly provide the vector.

netapy · 2023-09-09T10:54:57Z

I might try with the default "word" tokenization later to see if it solves it. Will keep this updated

Doing this doesn't change anything – source is still not tokenized.

donomii · 2023-09-11T09:06:45Z

Is this a public dataset? Would I be able to download it and try to import it the same way you did?

Otherwise, could you give me some details about how much data, how many records, and the maximum size of the records?

netapy · 2023-09-11T16:27:15Z

Is this a public dataset? Would I be able to download it and try to import it the same way you did?

Otherwise, could you give me some details about how much data, how many records, and the maximum size of the records?

No it is public data but unfortunately it needs quite a bit of structuring.
Would it help if I sent you my docker volume with all the articles ?
Maybe over the weaviate slack

MHilhorst · 2023-09-14T06:12:28Z

we are having the same issues. Bm25 does not return any results while the objects do exist and we are querying on the correct class. The problem seem to occur after some time when uploading a new object. When a fresh object is uploaded bm25 seem to work for some time.

donomii · 2023-09-14T08:31:07Z

We are currently investigating the problem as a priority, and we hope to have a solution soon.

…

On Thu, 14 Sept 2023 at 08:12, Michael Hilhorst ***@***.***> wrote: we are having the same issues. Bm25 does not return any results while the objects do exist and we are querying on the correct class. The problem seem to occur after some time when uploading a new object. When a fresh object is uploaded bm25 seem to work for some time. — Reply to this email directly, view it on GitHub <#3517 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABP7X2ECRDYXH226YDSA4HLX2KN5PANCNFSM6AAAAAA4QYXQEI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

donomii · 2023-09-16T12:01:59Z

A quick update - we have confirmed this bug and are still investigating it

narayanacharya6 · 2023-09-19T13:59:35Z

Can any details of the bug be shared and any possible workarounds (other than rewriting the data to Weaviate)?

makarsergeev · 2023-09-19T15:21:33Z

+1

PythonAlchemist · 2023-09-19T15:35:38Z

+1

algebrei · 2023-09-19T15:53:25Z

+1

tommykoctur · 2023-09-19T16:37:27Z

+1

pavelnemirovsky · 2023-09-19T21:39:31Z

+1

donomii · 2023-09-20T14:47:23Z

Can any details of the bug be shared and any possible workarounds (other than rewriting the data to Weaviate)?

The bug appears to be index corruption, so your data is safe. We could add an option to re-index the data inside weaviate while we fix the bug.

narayanacharya6 · 2023-09-20T14:53:03Z

Thanks @donomii!

How frequently will reindexing be done? Does this impact performance for other requests being served by Weaviate?
Can this be released as a patch?

donomii · 2023-09-20T14:59:01Z

Thanks @donomii!

How frequently will reindexing be done? Does this impact performance for other requests being served by Weaviate?

Can this be released as a patch?

Reindexing would be manual, and would certainly impact speed.
At this point I don't know how we would release it.

narayanacharya6 · 2023-09-20T16:51:56Z

Understood. This impacts our adoption of Weaviate because we'd like to use the changes for ContainsAny/ContainsAll operators in v1.21 but BM25 and in-turn hybrid search is unusable in these versions. Please keep us posted with when a fix or option to manually re-index is available.

negon · 2023-09-27T07:30:57Z

The error is pretty consistent now.
The dataset I'm using is around 100k simple texts and a number (artwork titles and artwork inventory number).
If I drop the class and reindex everything it's working fine, but after about a day the bm25 returns nothing.
I'm using 1.21.1, running locally.

krskrs · 2023-09-27T08:20:43Z

What sounds a bit weird to me is how such a central feature has not been fixed yet. Maybe is everybody using weaviate only as a vector db? We had chosen it over its competitors just for its bm25 and hybrid support.

alex-cortenix · 2023-09-27T13:34:57Z

What sounds a bit weird to me is how such a central feature has not been fixed yet. Maybe is everybody using weaviate only as a vector db? We had chosen it over its competitors just for its bm25 and hybrid support.

I am second to that, the main factor to choose Weaviate against competitors was a hybrid search. Seems the team was very busy releasing a new Python client (which BTW is irrelevant for us) and now should have more resources to fix this annoying issue...

etiennedi · 2023-09-27T16:51:27Z

Thanks, everyone for following up. It seems this issue is more common than we initially thought and we just got together to brainstorm how we can address this more urgently.

Thanks, @krskrs and @alex-cortenix, I agree and you raise good points. Weaviate's hybrid search is and should be a differentiating factor and we'll do everything in our power to make sure you have a great experience with BM25/Hybrid search going forward.

The main blocker on our side so far has been that we haven't been able to reliably reproduce the issue. All automated test pipelines are green and attempts to reproduce manually have been unsuccessful.

Therefore, I would like to ask for your help! It seems that some of you seem to reliably run into this issue. Can you please help us get to that point as well. In particular, we need help with:

Any kind of script that assumes no previous state (i.e. starts with an empty Weaviate) and runs into this issue?
What are the time components of reproduction? We have some users stating that this seemingly breaks overnight; i.e. it's fine one day and broken the next without any deliberate changes
Is there a specific version that this is makes this issue more likely? Does it go away when downgrading to a specific version? Is there a minimum version that's required for the issue to surface?

Thanks for your help, let's tackle this one together! The more input we can receive from your end, the faster we can provide a fix. Thank you!

etiennedi · 2023-09-28T01:22:38Z

Update

We can reliably reproduce the bug. A script to do so is attached below.
The bug appears from v1.21.0. All previous versions (e.g. v1.20.6) are unaffected. All later versions are affected.
The bug is likely related to compactions. Turning them off makes the bug disappear.

Version bisecting

v1.19.0 ✅
v1.20.0 ✅
v1.20.6 (latest 1.20 patch) ✅
v1.21.0 ❌
master ❌

Reproduction

Download the following book from project Gutenberg in plain text form: https://www.gutenberg.org/ebooks/48871.txt.utf-8
Start up Weaviate with PERSISTENCE_FLUSH_IDLE_MEMTABLES_AFTER=3 (<-- this is important because the script from step 3 uses a 5 second pause to force a new segment and therefore compactions)
Run this script. It should typically fail anywhere between iteration 5 and 9

Once the script indicates a corruption, the state stays corrupted as long as you don't import any more objects. To prove this, we can look for a very common word, such as Archive.

As the following screenshot shows, Archive is a very common word:

However, it returns zero results for in a BM25 search:

krskrs · 2023-09-28T14:06:46Z

Hi @etiennedi that's great news! Unfortunately I haven't the time at this exact moment to test the script, but I can indeed confirm your points.

we started noticing this bug after upgrading to 1.21
I was suspecting that it wasn't happening directly on write, but with some sort of "offline/delayed" operation, because our current workflow consists in importing some data, updating it from time to time (we do it, so no updates from users), and querying it for most part of the time. And I had noticed that suddenly, it would just stop working, but not immediately after the writes. We don't work with local weaviate instances, but on weaviate cloud only, so the flush interval is the default used on such instances.

etiennedi · 2023-09-30T15:46:11Z

Please see PR #3592 for a detailed explanation of what caused this, why it only surfaced in v1.21, and how it's fixed.

donomii self-assigned this Sep 16, 2023

etiennedi added the bug label Sep 19, 2023

etiennedi changed the title ~~BM25/Tokenizer not working properly ?~~ BM25 returns no results after a while (Original title: BM25/Tokenizer not working properly) Sep 19, 2023

etiennedi changed the title ~~BM25 returns no results after a while (Original title: BM25/Tokenizer not working properly)~~ BM25 returns no results in some situations (Original title: BM25/Tokenizer not working properly) Sep 19, 2023

donomii removed their assignment Sep 27, 2023

donomii self-assigned this Sep 27, 2023

donomii assigned parkerduckworth and unassigned donomii Sep 28, 2023

etiennedi mentioned this issue Sep 30, 2023

Fix issue where BM25 would sometimes return no results after a compaction #3592

Merged

4 tasks

antas-marcin closed this as completed in b5faba1 Oct 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BM25 returns no results in some situations (Original title: BM25/Tokenizer not working properly) #3517

BM25 returns no results in some situations (Original title: BM25/Tokenizer not working properly) #3517

netapy commented Sep 8, 2023

donomii commented Sep 8, 2023

donomii commented Sep 8, 2023

netapy commented Sep 9, 2023 •

edited

netapy commented Sep 9, 2023

netapy commented Sep 9, 2023

donomii commented Sep 11, 2023

netapy commented Sep 11, 2023

MHilhorst commented Sep 14, 2023

donomii commented Sep 14, 2023 via email

donomii commented Sep 16, 2023

narayanacharya6 commented Sep 19, 2023

makarsergeev commented Sep 19, 2023

PythonAlchemist commented Sep 19, 2023

algebrei commented Sep 19, 2023

tommykoctur commented Sep 19, 2023

pavelnemirovsky commented Sep 19, 2023

donomii commented Sep 20, 2023

narayanacharya6 commented Sep 20, 2023

donomii commented Sep 20, 2023

narayanacharya6 commented Sep 20, 2023

negon commented Sep 27, 2023

krskrs commented Sep 27, 2023

alex-cortenix commented Sep 27, 2023

etiennedi commented Sep 27, 2023

etiennedi commented Sep 28, 2023

krskrs commented Sep 28, 2023

etiennedi commented Sep 30, 2023

BM25 returns no results in some situations (Original title: BM25/Tokenizer not working properly) #3517

BM25 returns no results in some situations (Original title: BM25/Tokenizer not working properly) #3517

Comments

netapy commented Sep 8, 2023

donomii commented Sep 8, 2023

donomii commented Sep 8, 2023

netapy commented Sep 9, 2023 • edited

netapy commented Sep 9, 2023

netapy commented Sep 9, 2023

donomii commented Sep 11, 2023

netapy commented Sep 11, 2023

MHilhorst commented Sep 14, 2023

donomii commented Sep 14, 2023 via email

donomii commented Sep 16, 2023

narayanacharya6 commented Sep 19, 2023

makarsergeev commented Sep 19, 2023

PythonAlchemist commented Sep 19, 2023

algebrei commented Sep 19, 2023

tommykoctur commented Sep 19, 2023

pavelnemirovsky commented Sep 19, 2023

donomii commented Sep 20, 2023

narayanacharya6 commented Sep 20, 2023

donomii commented Sep 20, 2023

narayanacharya6 commented Sep 20, 2023

negon commented Sep 27, 2023

krskrs commented Sep 27, 2023

alex-cortenix commented Sep 27, 2023

etiennedi commented Sep 27, 2023

etiennedi commented Sep 28, 2023

Update

Version bisecting

Reproduction

krskrs commented Sep 28, 2023

etiennedi commented Sep 30, 2023

netapy commented Sep 9, 2023 •

edited