-
Notifications
You must be signed in to change notification settings - Fork 683
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BM25 returns no results in some situations (Original title: BM25/Tokenizer not working properly) #3517
Comments
Could you do the search with only the source property (remove the titre)? e.g. query { |
Also, if you raise the limit (e.g. to 100), do the correct results appear in the list? |
@donomii Thanks for your answer.
I just tried, and weirdly enough it just gives me 0 results – doesn't find anything (out of +300k articles)
It doesn't show up. It really seems that the source property has not been tokenized. I might try with the default "word" tokenization later to see if it solves it. Will keep this updated |
Also, for what it's worth, when I batch add my Articles I directly provide the vector. |
Doing this doesn't change anything – source is still not tokenized. |
Is this a public dataset? Would I be able to download it and try to import it the same way you did? Otherwise, could you give me some details about how much data, how many records, and the maximum size of the records? |
No it is public data but unfortunately it needs quite a bit of structuring. |
we are having the same issues. Bm25 does not return any results while the objects do exist and we are querying on the correct class. The problem seem to occur after some time when uploading a new object. When a fresh object is uploaded bm25 seem to work for some time. |
We are currently investigating the problem as a priority, and we hope to
have a solution soon.
…On Thu, 14 Sept 2023 at 08:12, Michael Hilhorst ***@***.***> wrote:
we are having the same issues. Bm25 does not return any results while the
objects do exist and we are querying on the correct class. The problem seem
to occur after some time when uploading a new object. When a fresh object
is uploaded bm25 seem to work for some time.
—
Reply to this email directly, view it on GitHub
<#3517 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABP7X2ECRDYXH226YDSA4HLX2KN5PANCNFSM6AAAAAA4QYXQEI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
A quick update - we have confirmed this bug and are still investigating it |
Can any details of the bug be shared and any possible workarounds (other than rewriting the data to Weaviate)? |
+1 |
4 similar comments
+1 |
+1 |
+1 |
+1 |
The bug appears to be index corruption, so your data is safe. We could add an option to re-index the data inside weaviate while we fix the bug. |
Thanks @donomii!
|
|
Understood. This impacts our adoption of Weaviate because we'd like to use the changes for |
The error is pretty consistent now. |
What sounds a bit weird to me is how such a central feature has not been fixed yet. Maybe is everybody using weaviate only as a vector db? We had chosen it over its competitors just for its bm25 and hybrid support. |
I am second to that, the main factor to choose Weaviate against competitors was a hybrid search. Seems the team was very busy releasing a new Python client (which BTW is irrelevant for us) and now should have more resources to fix this annoying issue... |
Thanks, everyone for following up. It seems this issue is more common than we initially thought and we just got together to brainstorm how we can address this more urgently. Thanks, @krskrs and @alex-cortenix, I agree and you raise good points. Weaviate's hybrid search is and should be a differentiating factor and we'll do everything in our power to make sure you have a great experience with BM25/Hybrid search going forward. The main blocker on our side so far has been that we haven't been able to reliably reproduce the issue. All automated test pipelines are green and attempts to reproduce manually have been unsuccessful. Therefore, I would like to ask for your help! It seems that some of you seem to reliably run into this issue. Can you please help us get to that point as well. In particular, we need help with:
Thanks for your help, let's tackle this one together! The more input we can receive from your end, the faster we can provide a fix. Thank you! |
Update
Version bisecting
Reproduction
Once the script indicates a corruption, the state stays corrupted as long as you don't import any more objects. To prove this, we can look for a very common word, such as As the following screenshot shows, However, it returns zero results for in a BM25 search: |
Hi @etiennedi that's great news! Unfortunately I haven't the time at this exact moment to test the script, but I can indeed confirm your points.
|
Please see PR #3592 for a detailed explanation of what caused this, why it only surfaced in v1.21, and how it's fixed. |
Hi –
I have an Article class, which contains many french legal articles.
Here is the class initiation :
I want to let my users search for legal articles through their "titre" and "source" properties, which have been tokenized using
lowercase
.Here is an example of an article I'm trying to find :
Using the following query :
...gives me the following results :
Am i missing something ?
It seems like it should be able to find it because i'm literally copy/pasting the exact source name.
Could it be an issue with the lowercase tokenizer ?
I'd be happy to provide you with further information if needed.
Thanks in advance.
The text was updated successfully, but these errors were encountered: