Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BM25 returns no results in some situations (Original title: BM25/Tokenizer not working properly) #3517

Closed
netapy opened this issue Sep 8, 2023 · 27 comments
Assignees
Labels

Comments

@netapy
Copy link

netapy commented Sep 8, 2023

Hi –

I have an Article class, which contains many french legal articles.

Here is the class initiation :

class_obj = {
    "class": "Article",
    "description": "Articles des différentes codes de la loi.",
    "invertedIndexConfig": {
        "stopwords": {
            "preset": "none",
            #"additions": stopwords_fr
        }
    },
    "vectorizer": "text2vec-transformers",
    "properties": [
        {
            "name": "article_id",
            "description": "Id unique de l'article.",
            "dataType": [
                "text"
            ],
            "indexFilterable": True,
            "indexSearchable": False, 
        },
        {
            "name": "source",
            "description": "Titre de la source juridique (code, loi ou ordonnance) contenant l'article.",
            "dataType": [
                "text"
            ],
            "tokenization": "lowercase",
            "indexFilterable": True,
            "indexSearchable": True, 
        },
        {
            "name": "titre",
            "description": "Le titre de l'article.",
            "dataType": [
                "text"
            ],
            "indexFilterable": True,
            "indexSearchable": True, 
            "tokenization": "lowercase"
        },
        {
            "name": "texte",
            "description": "Le texte de l'article, en html.",
            "dataType": [
                "text"
            ],
            "moduleConfig": {
                "text2vec-transformers": {
                    "skip": False,
                    "vectorizePropertyName": False
                }
            },
            "indexFilterable": True,
            "indexSearchable": True, 
        },
        {
            "name": "etat",
            "description": "Etat de l'article : en vigueur, abrogé...",
            "dataType": [
                "text"
            ],
            "indexFilterable": True,
            "indexSearchable": False, 
        },
        {
            "name": "path_title",
            "description": "Chemin daccès à l'article",
            "dataType": [
                "text[]"
            ],
            "indexFilterable": True,
            "indexSearchable": True, 
        },
        {
            "name": "ref_textes",
            "description": "Références avec d'autres textes.",
            "dataType": [
                "text"
            ],
            "indexFilterable": True,
            "indexSearchable": False, 
        },
        {
            "name": "order",
            "description": "Ordre de l'article dans le code.",
            "dataType": [
                "int"
            ],
            "indexFilterable": True,
            "indexSearchable": False,
        },
        {
            "name": "date_deb",
            "description": "Date de début de l'article.",
            "dataType": [
                "int"
            ],
            "indexFilterable": True,
            "indexSearchable": False,
        },
        {
            "name": "date_fin",
            "description": "Date de fin de l'article.",
            "dataType": [
                "int"
            ],
            "indexFilterable": True,
            "indexSearchable": False,
        },
    ]
}

client.schema.create_class(class_obj)

I want to let my users search for legal articles through their "titre" and "source" properties, which have been tokenized using lowercase.

Here is an example of an article I'm trying to find :

{
    "article_id": "JORFARTI000047663197",
    "etat": "VIGUEUR",
    "path_title": [
        "Titre IER : DE LA NATURE DE L'ACTIVITÉ D'INFLUENCE COMMERCIALE PAR VOIE ÉLECTRONIQUE ET DES OBLIGATIONS AFFÉRENTES À SON EXERCICE",
        "Chapitre III : Dispositions générales relatives à l'activité d'agent d'influenceur, aux contrats d'influence commerciale par voie électronique, à la responsabilité civile solidaire et à l'assurance civile professionnelle"
    ],
    "source": "LOI n° 2023-451 du 9 juin 2023 visant à encadrer l'influence commerciale et à lutter contre les dérives des influenceurs sur les réseaux sociaux (1)",
    "texte": ".....blablabla....",
    "titre": "7"
},

Using the following query :

query {
    Get {
    Article(
        limit: 5,
        bm25: {
            query: "LOI 9 juin 2023 visant à encadrer l'influence"
            properties: ["source^3", "titre"]
        }
        ) {
            article_id
            titre
            path_title
            texte
            source
            _additional {
                score
            }                
        }
    }
}

...gives me the following results :

# only printing the "titre" and "source" in a list
[['2023', 'Code civil'], ["Annexe 9 à l'article A4241-50-2", 'Code des transports'], ["Annexe à l'article R*351-1, art. 9", 'Code des ports maritimes'], ['437 à 614-26', 'Code de commerce (ancien)'], ['L79 à L85', 'Code électoral']]

Am i missing something ?
It seems like it should be able to find it because i'm literally copy/pasting the exact source name.
Could it be an issue with the lowercase tokenizer ?

I'd be happy to provide you with further information if needed.

Thanks in advance.

@donomii
Copy link
Contributor

donomii commented Sep 8, 2023

Could you do the search with only the source property (remove the titre)?

e.g.

query {
Get {
Article(
limit: 5,
bm25: {
query: "LOI 9 juin 2023 visant à encadrer l'influence"
properties: ["source"]
}
) {
article_id
titre
path_title
texte
source
_additional {
score
}
}
}
}

@donomii
Copy link
Contributor

donomii commented Sep 8, 2023

Also, if you raise the limit (e.g. to 100), do the correct results appear in the list?

@netapy
Copy link
Author

netapy commented Sep 9, 2023

@donomii Thanks for your answer.

Could you do the search with only the source property (remove the titre)?

I just tried, and weirdly enough it just gives me 0 results – doesn't find anything (out of +300k articles)

Also, if you raise the limit (e.g. to 100), do the correct results appear in the list?

It doesn't show up.

It really seems that the source property has not been tokenized.

I might try with the default "word" tokenization later to see if it solves it. Will keep this updated

@netapy
Copy link
Author

netapy commented Sep 9, 2023

Also, for what it's worth, when I batch add my Articles I directly provide the vector.

@netapy
Copy link
Author

netapy commented Sep 9, 2023

I might try with the default "word" tokenization later to see if it solves it. Will keep this updated

Doing this doesn't change anything – source is still not tokenized.

@donomii
Copy link
Contributor

donomii commented Sep 11, 2023

Is this a public dataset? Would I be able to download it and try to import it the same way you did?

Otherwise, could you give me some details about how much data, how many records, and the maximum size of the records?

@netapy
Copy link
Author

netapy commented Sep 11, 2023

Is this a public dataset? Would I be able to download it and try to import it the same way you did?

Otherwise, could you give me some details about how much data, how many records, and the maximum size of the records?

No it is public data but unfortunately it needs quite a bit of structuring.
Would it help if I sent you my docker volume with all the articles ?
Maybe over the weaviate slack

@MHilhorst
Copy link

we are having the same issues. Bm25 does not return any results while the objects do exist and we are querying on the correct class. The problem seem to occur after some time when uploading a new object. When a fresh object is uploaded bm25 seem to work for some time.

@donomii
Copy link
Contributor

donomii commented Sep 14, 2023 via email

@donomii
Copy link
Contributor

donomii commented Sep 16, 2023

A quick update - we have confirmed this bug and are still investigating it

@donomii donomii self-assigned this Sep 16, 2023
@etiennedi etiennedi added the bug label Sep 19, 2023
@etiennedi etiennedi changed the title BM25/Tokenizer not working properly ? BM25 returns no results after a while (Original title: BM25/Tokenizer not working properly) Sep 19, 2023
@etiennedi etiennedi changed the title BM25 returns no results after a while (Original title: BM25/Tokenizer not working properly) BM25 returns no results in some situations (Original title: BM25/Tokenizer not working properly) Sep 19, 2023
@narayanacharya6
Copy link

Can any details of the bug be shared and any possible workarounds (other than rewriting the data to Weaviate)?

@makarsergeev
Copy link

+1

4 similar comments
@PythonAlchemist
Copy link

+1

@algebrei
Copy link

+1

@tommykoctur
Copy link

+1

@pavelnemirovsky
Copy link

+1

@donomii
Copy link
Contributor

donomii commented Sep 20, 2023

Can any details of the bug be shared and any possible workarounds (other than rewriting the data to Weaviate)?

The bug appears to be index corruption, so your data is safe. We could add an option to re-index the data inside weaviate while we fix the bug.

@narayanacharya6
Copy link

Thanks @donomii!

  1. How frequently will reindexing be done? Does this impact performance for other requests being served by Weaviate?
  2. Can this be released as a patch?

@donomii
Copy link
Contributor

donomii commented Sep 20, 2023

Thanks @donomii!

  1. How frequently will reindexing be done? Does this impact performance for other requests being served by Weaviate?
  2. Can this be released as a patch?
  1. Reindexing would be manual, and would certainly impact speed.
  2. At this point I don't know how we would release it.

@narayanacharya6
Copy link

Understood. This impacts our adoption of Weaviate because we'd like to use the changes for ContainsAny/ContainsAll operators in v1.21 but BM25 and in-turn hybrid search is unusable in these versions. Please keep us posted with when a fix or option to manually re-index is available.

@negon
Copy link

negon commented Sep 27, 2023

The error is pretty consistent now.
The dataset I'm using is around 100k simple texts and a number (artwork titles and artwork inventory number).
If I drop the class and reindex everything it's working fine, but after about a day the bm25 returns nothing.
I'm using 1.21.1, running locally.

@krskrs
Copy link

krskrs commented Sep 27, 2023

What sounds a bit weird to me is how such a central feature has not been fixed yet. Maybe is everybody using weaviate only as a vector db? We had chosen it over its competitors just for its bm25 and hybrid support.

@donomii donomii removed their assignment Sep 27, 2023
@alex-cortenix
Copy link

What sounds a bit weird to me is how such a central feature has not been fixed yet. Maybe is everybody using weaviate only as a vector db? We had chosen it over its competitors just for its bm25 and hybrid support.

I am second to that, the main factor to choose Weaviate against competitors was a hybrid search. Seems the team was very busy releasing a new Python client (which BTW is irrelevant for us) and now should have more resources to fix this annoying issue...

@etiennedi
Copy link
Member

Thanks, everyone for following up. It seems this issue is more common than we initially thought and we just got together to brainstorm how we can address this more urgently.

Thanks, @krskrs and @alex-cortenix, I agree and you raise good points. Weaviate's hybrid search is and should be a differentiating factor and we'll do everything in our power to make sure you have a great experience with BM25/Hybrid search going forward.

The main blocker on our side so far has been that we haven't been able to reliably reproduce the issue. All automated test pipelines are green and attempts to reproduce manually have been unsuccessful.

Therefore, I would like to ask for your help! It seems that some of you seem to reliably run into this issue. Can you please help us get to that point as well. In particular, we need help with:

  • Any kind of script that assumes no previous state (i.e. starts with an empty Weaviate) and runs into this issue?
  • What are the time components of reproduction? We have some users stating that this seemingly breaks overnight; i.e. it's fine one day and broken the next without any deliberate changes
  • Is there a specific version that this is makes this issue more likely? Does it go away when downgrading to a specific version? Is there a minimum version that's required for the issue to surface?

Thanks for your help, let's tackle this one together! The more input we can receive from your end, the faster we can provide a fix. Thank you!

@donomii donomii self-assigned this Sep 27, 2023
@etiennedi
Copy link
Member

Update

  1. We can reliably reproduce the bug. A script to do so is attached below.
  2. The bug appears from v1.21.0. All previous versions (e.g. v1.20.6) are unaffected. All later versions are affected.
  3. The bug is likely related to compactions. Turning them off makes the bug disappear.

Version bisecting

  • v1.19.0
  • v1.20.0
  • v1.20.6 (latest 1.20 patch) ✅
  • v1.21.0
  • master

Reproduction

  1. Download the following book from project Gutenberg in plain text form: https://www.gutenberg.org/ebooks/48871.txt.utf-8
  2. Start up Weaviate with PERSISTENCE_FLUSH_IDLE_MEMTABLES_AFTER=3 (<-- this is important because the script from step 3 uses a 5 second pause to force a new segment and therefore compactions)
  3. Run this script. It should typically fail anywhere between iteration 5 and 9

Once the script indicates a corruption, the state stays corrupted as long as you don't import any more objects. To prove this, we can look for a very common word, such as Archive.

As the following screenshot shows, Archive is a very common word:
Screenshot 2023-09-27 at 5 09 28 PM

However, it returns zero results for in a BM25 search:

Screenshot 2023-09-27 at 5 08 18 PM

@krskrs
Copy link

krskrs commented Sep 28, 2023

Hi @etiennedi that's great news! Unfortunately I haven't the time at this exact moment to test the script, but I can indeed confirm your points.

  • we started noticing this bug after upgrading to 1.21
  • I was suspecting that it wasn't happening directly on write, but with some sort of "offline/delayed" operation, because our current workflow consists in importing some data, updating it from time to time (we do it, so no updates from users), and querying it for most part of the time. And I had noticed that suddenly, it would just stop working, but not immediately after the writes. We don't work with local weaviate instances, but on weaviate cloud only, so the flush interval is the default used on such instances.

@etiennedi
Copy link
Member

Please see PR #3592 for a detailed explanation of what caused this, why it only surfaced in v1.21, and how it's fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests