-
Notifications
You must be signed in to change notification settings - Fork 749
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add _snippets in GraphQL and REST API #1159
Comments
Note: the JSON result misses the |
Great idea, some thoughts
|
I also like the idea. I can build a small prototype next week.
My suggestion would be |
An alternative that we should also consider and validate this against:
Advantages
Disadvantages
|
@bobvanluijt regarding the suggested implementation. Assume the results contain 10 * 1000 word articles. If we do this per word that would be 10000 vectorizations (minues potential duplicates) plus 10,000 (minus potential duplicates) vector comparisons. This might be (needs evaluation) too expensive for query time. For an alternative approach that offloads some of the cost to index time, see the previous post. |
Couldn't we do a binary search? This would cut our search from O(n) to O(log n). |
What we can (relatively easily*) do is just record the word occurrence position in the inverse index (I think lucene does the same), but that would again only help for exact matches, not for vector-similarity matches. * = once we have the custom db strategy |
I thought about something like this:
|
I like the idea, but definitely requires validation. Not sure if you split a 1000 word article into two 500 word articles if a single sentence influences the vector position enough. But definitely worth a shot. It would indeed reduce the number of comparisons (but not the number of vectorizations) |
So far I ran a test for 4 different queries. Linear search (word by word): Speeds: Divide and search:
Pre defined snippets: Speeds: Conclusion: Divide and search can be valid strategy compared to predefined snippets. However all of the tests break the 100ms mark when using more text. How should we proceed? |
@etiennedi please note that my spike does it per sentence and not per word ;-) |
The time is only the search for the text snippet. Aka the time a function
The linear search took 5.7 seconds to execute on the text corpus. The pre defined snippets test is equal to what bob did before. |
Cool, The reason I predefined is, it to not do it on a word for word basis but based on complete sentences to significantly speed the process up. |
Thanks a lot @fefi42 for the investigation so far. I am surprised by how slow this is. At the moment we are orders of magnitudes away from this being feasible for implementation. We need to get in the 200ms (better 100ms) range on an average query for this to be viable otherwise it'll be an operations nightmare under load. We don't have a definition of what an average query is, but a length of 4 seems very low to me. I'd assume "normal" is more in the range of 10-20. So a single article would be allowed to take 20ms max without incorporating the actual query time yet. @fefi42 was the result quality between divide&search and linear comparable? I've you have more time could you try:
If we cannot get this into a feasible response time, I see the possible alternatives:
|
The vectorization costs the most time. For the divide strategy the speed test looks like this: For 98 words: init 182ns 69.568684ms For 292 words: init 171ns 118.742434ms For 2709 words: init 87ns 730.08619ms Vectorizing on the fly takes almost all of the time. |
Thanks.
That's a pretty strong argument to do this outside of weaviate and use weaviate's schema as a helper to achieve the same effect as outlined in #1159 (comment) |
I do understand this argument, but ideally, it would be part of Weaviate during searching (or vectorizing on import time e.g., if the dataType = |
The above shows that at search time is not really an option. So at index time would certainly possible, essentially instead of storing one vector per item, we would store n vectors per item, where n depends on the split criteria (e.g. word, sentence, paragraph). Since that's potentially a lot of additional vectors, we should probably make this optional. :-) This is however a bit of a bigger change as it fundamentally changes how we store things, so I'd still recommend to implement this outside of Weaviate* for the first use case, as I can't give a reliable estimate and ETA on this change. * = it's not really outside so much as on top of Weaviate |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Still a feature I would love to see in the not-to-distant future, hence the un-stale message |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
When an object is indexed on a larger text item (e.g., a paragraph like in the news article demo) certain search terms can be found in sentences.
The idea is to add a starting point and endpoint of the most important part of the text corpus in the
__meta
end-point as apotentialAnswer
which can be enabled or disabled by setting andistanceToAnswer
. This can work both forexplore
filters aswhere
filters.Idea
I was searching for something on WikiPedia under the search term:
Is herbalife a pyramid scheme?
and got this response.Because Google isn't giving the actual answer but a location for the answer, we should be able to calculate something similar.
Explore example
Result where the
start
andend
give the starting and ending position and in whichproperty
the answer / most important part can be found.Where example
Result where the
start
andend
give the starting and ending position and in whichproperty
the answer / most important part can be found.Suggested (first) implementation
start
andend
point are found at the beginning and end of the sentence.distanceToAnswer
if the minimal distance, if it is not set, no start and end-points will be available, if multiple sentences make the mark, they will all be part of the array.*- there might be potential to also do this on groups of words or complete sentences.
Related
#1136 #1139 #1155 #1156
The text was updated successfully, but these errors were encountered: