Add _snippets in GraphQL and REST API #1159

bobvanluijt · 2020-06-05T10:13:58Z

When an object is indexed on a larger text item (e.g., a paragraph like in the news article demo) certain search terms can be found in sentences.

The idea is to add a starting point and endpoint of the most important part of the text corpus in the __meta end-point as a potentialAnswer which can be enabled or disabled by setting an distanceToAnswer. This can work both for explore filters as where filters.

Idea

I was searching for something on WikiPedia under the search term: Is herbalife a pyramid scheme? and got this response.

Because Google isn't giving the actual answer but a location for the answer, we should be able to calculate something similar.

Explore example

{
  Get{
    Things{
      Article(
        explore: {
          concepts: ["I want a spare rib"],
          certainty: 0.7,
          moveAwayFrom: {
            concepts: ["bacon"],
            force: 0.45
          }
        }
      ){
        name
        __meta {
          potentialAnswer(
            distanceToAnswer: 0.5 # <== optional
          ) {
            start
            end
            property
            distanceToQuery
          }
        }
      }
    }
  }
}

Result where the start and end give the starting and ending position and in which property the answer / most important part can be found.

{
    "data": {
        "Get": {
            "Things": {
                "Article": [
                    {
                        "name": "Bacon ipsum dolor amet tri-tip hamburger leberkas short ribs chicken turkey sirloin tenderloin shoulder pig bresaola. Pastrami ham hock meatball rump ribeye cupim, capicola venison burgdoggen brisket meatloaf. Turducken t-bone landjaeger pork chop, bresaola pig prosciutto pastrami sausage pancetta capicola short ribs hamburger tail spare ribs. Jerky kevin doner cupim pork belly picanha, pancetta capicola pork loin alcatra corned beef shank. Bacon chislic landjaeger doner corned beef, hamburger beef ribs filet mignon turducken tri-tip andouille pastrami chuck pork loin capicola. Prosciutto shankle chislic, shoulder tri-tip turducken meatball ham pork loin fatback hamburger pork chop bacon pork belly. Kevin sausage salami spare ribs tenderloin t-bone meatball picanha flank jowl pork chop tail turducken tri-tip.",
                        "__meta": [ // <== array because multiple results are possible and/or multiple 
                            {
                                "property": "name",
                                "distanceToQuery": 0.0, // <== distance to the query
                                "start": 26,// <== just a random example
                                "end": 130  // <== just a random example
                            }
                        ]
                    }
                ]
            }
        }
    },
    "errors": null
}

Where example

{
  Get {
    Things {
      Article(where: {
            path: ["name"],
            operator: Like,
            valueString: "New *"
        }) {
        name
        __meta {
          potentialAnswer(
            distanceToAnswer: 0.5 # <== optional
          ) {
            start
            end
            property
            distanceToQuery
          }
        }
      }
    }
  }
}

Result where the start and end give the starting and ending position and in which property the answer / most important part can be found.

{
    "data": {
        "Get": {
            "Things": {
                "Article": [
                    {
                        "name": "Bacon ipsum dolor amet tri-tip hamburger leberkas short ribs chicken turkey sirloin tenderloin shoulder pig bresaola. Pastrami ham hock meatball rump ribeye cupim, capicola venison burgdoggen brisket meatloaf. Turducken t-bone landjaeger pork chop, bresaola pig prosciutto pastrami sausage pancetta capicola short ribs hamburger tail spare ribs. Jerky kevin doner cupim pork belly picanha, pancetta capicola pork loin alcatra corned beef shank. Bacon chislic landjaeger doner corned beef, hamburger beef ribs filet mignon turducken tri-tip andouille pastrami chuck pork loin capicola. Prosciutto shankle chislic, shoulder tri-tip turducken meatball ham pork loin fatback hamburger pork chop bacon pork belly. Kevin sausage salami spare ribs tenderloin t-bone meatball picanha flank jowl pork chop tail turducken tri-tip.",
                        "__meta": [ // <== array because multiple results are possible and/or multiple properties might be indexed
                            {
                                "property": "name",
                                "distanceToQuery": 0.0, // <== distance to the query
                                "start": 26,// <== just a random example
                                "end": 130  // <== just a random example
                            }
                        ]
                    }
                ]
            }
        }
    },
    "errors": null
}

Suggested (first) implementation

Results are returned like the current implementation.
The vectorization of the query is used to find the closest match of a word in a sentence. *
When the closest word is found, the start and endpoint are found at the beginning and end of the sentence.
The distanceToAnswer if the minimal distance, if it is not set, no start and end-points will be available, if multiple sentences make the mark, they will all be part of the array.

*- there might be potential to also do this on groups of words or complete sentences.

@michaverhagen and @fefi42 Can one of you build a minimal poc to validate this outside of Weaviate as part of the usecase that inspired this idea. If the results are good, we can think about incorporating this as a feature.
In the naming maybe we could use something more general than "answer" as this feature would also benefit you if you don't specify a question, but just a "regular search term".

fefi42 · 2020-06-05T10:37:43Z

I also like the idea. I can build a small prototype next week.

we could use something more general than "answer"

My suggestion would be paragraph or snippet

etiennedi · 2020-06-05T11:59:48Z

An alternative that we should also consider and validate this against:

Create two classes "Paragraph" and "Article" with "Paragraph.OfArticle" reference and "Article.hasParagraphs"
On import time, split up article into paragraphs and index as individual paragraphs with link to title
On search, search through paragraphs rather than article. Once the best paragraph is found, present both the paragraph as well as the whole article to the user

Advantages

Should be much faster, because the vector positions for each paragraph are craeted at import time rather than at search time and can therefor be indexed for an efficient search (e.g. HNSW)

Disadvantages

Requires additional logic on top of Weaviate

etiennedi · 2020-06-05T12:02:09Z

@bobvanluijt regarding the suggested implementation.

Assume the results contain 10 * 1000 word articles. If we do this per word that would be 10000 vectorizations (minues potential duplicates) plus 10,000 (minus potential duplicates) vector comparisons. This might be (needs evaluation) too expensive for query time. For an alternative approach that offloads some of the cost to index time, see the previous post.

fefi42 · 2020-06-05T12:05:36Z

Couldn't we do a binary search? This would cut our search from O(n) to O(log n).

etiennedi · 2020-06-05T12:06:37Z

Couldn't we do a binary search? This would cut our search from O(n) to O(log n).

A binary search requires a sorted set, an article is typically not a list of sorted words
This would only help us for exact occurrences of a word, not for contextual similarity

What we can (relatively easily*) do is just record the word occurrence position in the inverse index (I think lucene does the same), but that would again only help for exact matches, not for vector-similarity matches.

* = once we have the custom db strategy

fefi42 · 2020-06-05T12:09:53Z

I thought about something like this:

split text in half
calculate distance of query to both parts
select half with smaller distance.
continue with 1 until on paragraph size

etiennedi · 2020-06-05T12:13:58Z

I thought about something like this:

I like the idea, but definitely requires validation. Not sure if you split a 1000 word article into two 500 word articles if a single sentence influences the vector position enough. But definitely worth a shot. It would indeed reduce the number of comparisons (but not the number of vectorizations)

fefi42 · 2020-06-08T11:55:00Z

So far I ran a test for 4 different queries.
Article 1: 98 words
Article 2: 292 words
Article 3: 98 words (same as 1 but different question)
Article 4: 2709 words

Linear search (word by word):
This is by far the slowest but also gives worse results than the other searches.
Example question: "How much did the Manhattan Project cost?"
This question is closest to the word "Manhattan" which appears multiple times in the text.
Therefore is the returned word/sentence not necessarily the one that is actually closest to the question asked.

Speeds:
1: 187.143164ms
2: 547.444578ms
3: 209.306972ms
4: 5.793083927s

Divide and search:
The speed of this depends on the stop criterion.
It divides the corpus until its length is less than n.

n = 50
Speeds:
1: 99.835597ms
2: 161.984499ms
3: 71.573273ms
4: 921.776241ms

n = 100 is competitive with pre-defined snippets.
Speeds:
44.415232ms
110.879809ms
42.971915ms
710.416374ms

Pre defined snippets:
In this case the corpus was already split by sentences. Then a linear search for the closest sentence is done.

Speeds:
1: 49.154119ms
2: 107.153473ms
3: 44.226528ms
4: 687.757241ms

Conclusion: Divide and search can be valid strategy compared to predefined snippets. However all of the tests break the 100ms mark when using more text.

How should we proceed?
I have not tested a lot of other data sets yet because I want to know if these results are worth going deeper into this. Over half a second is a long time for searching on a text file imo.

bobvanluijt · 2020-06-08T12:08:45Z

My suggestion would be paragraph or snippet

@fefi42 I like snippet

So far I ran a test for 4 different queries.

@fefi42 just to make sure, this is only on the objects that the explore function returns and not all objects, right? All are < 1 second isn't it?

bobvanluijt · 2020-06-08T12:09:44Z

@bobvanluijt regarding the suggested implementation.

@etiennedi please note that my spike does it per sentence and not per word ;-)

fefi42 · 2020-06-08T12:27:59Z

just to make sure, this is only on the objects that the explore function returns and not all objects, right?

The time is only the search for the text snippet. Aka the time a function findSnippet(text, query) (snippet, distance) takes to execute.

All are < 1 second isn't it?

The linear search took 5.7 seconds to execute on the text corpus.

The pre defined snippets test is equal to what bob did before.

bobvanluijt · 2020-06-08T12:40:50Z

The pre defined snippets test is equal to what bob did before.

Cool, bob, that's me.

The reason I predefined is, it to not do it on a word for word basis but based on complete sentences to significantly speed the process up.

etiennedi · 2020-06-08T12:54:07Z

Thanks a lot @fefi42 for the investigation so far.

I am surprised by how slow this is. At the moment we are orders of magnitudes away from this being feasible for implementation. We need to get in the 200ms (better 100ms) range on an average query for this to be viable otherwise it'll be an operations nightmare under load. We don't have a definition of what an average query is, but a length of 4 seems very low to me. I'd assume "normal" is more in the range of 10-20. So a single article would be allowed to take 20ms max without incorporating the actual query time yet.

@fefi42 was the result quality between divide&search and linear comparable?

I've you have more time could you try:

~~Splitting by every sentence instead of every word. (As a simplification assume every dot or line break indicates a new sentence)~~ EDIT: I just saw this is what you've already done as "pre-defined snippets"
Measure what it is that actually takes so much time. (If it's the subsequent requests to vectorize, then a potential timesaver would be one big one instead of many small ones)

If we cannot get this into a feasible response time, I see the possible alternatives:

Do the approach outlined in Add _snippets in GraphQL and REST API #1159 (comment) where all the work happens at import time and at search time we just have a normal search
Limit this feature to the very first search result. With the idea being "we took the best result and further searched inside it"

fefi42 · 2020-06-08T13:46:55Z

The vectorization costs the most time. For the divide strategy the speed test looks like this:

For 98 words:

init 182ns
split corpus 93ns
vectorize splitted corpus 43.196993ms
distances 7.279µs
split corpus 77ns
vectorize splitted corpus 13.169368ms
distances 2.202µs
split corpus 89ns
vectorize splitted corpus 9.059969ms
distances 3.434µs

69.568684ms

For 292 words:

init 171ns
split corpus 111ns
vectorize splitted corpus 53.721253ms
distances 2.225µs
split corpus 81ns
vectorize splitted corpus 19.029338ms
distances 3.674µs
split corpus 111ns
vectorize splitted corpus 22.840342ms
distances 2.272µs
split corpus 76ns
vectorize splitted corpus 12.428266ms
distances 2.177µs
split corpus 93ns
vectorize splitted corpus 7.802975ms
distances 4.09µs

118.742434ms

For 2709 words:

init 87ns
split corpus 86ns
vectorize splitted corpus 335.728687ms
distances 2.315µs
split corpus 75ns
vectorize splitted corpus 157.303306ms
distances 2.293µs
split corpus 75ns
vectorize splitted corpus 112.149237ms
distances 3.581µs
split corpus 84ns
vectorize splitted corpus 47.402019ms
distances 2.362µs
split corpus 78ns
vectorize splitted corpus 35.664391ms
distances 2.903µs
split corpus 76ns
vectorize splitted corpus 13.888474ms
distances 3.371µs
split corpus 93ns
vectorize splitted corpus 10.601051ms
distances 3.501µs
split corpus 89ns
vectorize splitted corpus 10.730858ms
distances 3.346µs

730.08619ms

Vectorizing on the fly takes almost all of the time.

etiennedi · 2020-06-08T13:48:44Z

Thanks.

Vectorizing on the fly takes almost all of the time.

That's a pretty strong argument to do this outside of weaviate and use weaviate's schema as a helper to achieve the same effect as outlined in #1159 (comment)

bobvanluijt · 2020-06-08T13:55:08Z

That's a pretty strong argument to do this outside of weaviate and use weaviate's schema as a helper to achieve the same effect as outlined in #1159 (comment)

I do understand this argument, but ideally, it would be part of Weaviate during searching (or vectorizing on import time e.g., if the dataType = text)

etiennedi · 2020-06-08T14:03:36Z

The above shows that at search time is not really an option. So at index time would certainly possible, essentially instead of storing one vector per item, we would store n vectors per item, where n depends on the split criteria (e.g. word, sentence, paragraph).

Since that's potentially a lot of additional vectors, we should probably make this optional. :-)

This is however a bit of a bigger change as it fundamentally changes how we store things, so I'd still recommend to implement this outside of Weaviate* for the first use case, as I can't give a reliable estimate and ETA on this change.

* = it's not really outside so much as on top of Weaviate

stale · 2020-08-11T14:35:47Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

bobvanluijt · 2020-08-12T09:54:51Z

Still a feature I would love to see in the not-to-distant future, hence the un-stale message

stale · 2020-10-11T10:26:12Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

bobvanluijt added graphql Contextionary labels Jun 5, 2020

etiennedi added this to the _meta in GraphQL Get milestone Jun 5, 2020

fefi42 pushed a commit that referenced this issue Jun 8, 2020

Fix too high heap size; add time meassure to spike gh-1159

f9f34e2

fefi42 pushed a commit that referenced this issue Jun 8, 2020

Add search by word gh-1159

e5e69c3

fefi42 pushed a commit that referenced this issue Jun 8, 2020

Add search by devide and search by snippet gh-1159

dcf2dc9

etiennedi changed the title ~~Add suggested location for "answers" in __meta~~ Add _snippets in GraphQL and REST API Jun 12, 2020

etiennedi modified the milestones: _meta in GraphQL Get, Underscore Properties in both APIs Jun 12, 2020

etiennedi added the _underscoreProp label Jun 12, 2020

stale bot added the autoclosed Closed by the bot. We still want this, but it didn't quite make the latest prioritization round label Aug 11, 2020

stale bot removed the autoclosed Closed by the bot. We still want this, but it didn't quite make the latest prioritization round label Aug 12, 2020

stale bot added the autoclosed Closed by the bot. We still want this, but it didn't quite make the latest prioritization round label Oct 11, 2020

stale bot closed this as completed Oct 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add _snippets in GraphQL and REST API #1159

Add _snippets in GraphQL and REST API #1159

bobvanluijt commented Jun 5, 2020

bobvanluijt commented Jun 5, 2020

etiennedi commented Jun 5, 2020

fefi42 commented Jun 5, 2020

etiennedi commented Jun 5, 2020 •

edited

Loading

etiennedi commented Jun 5, 2020 •

edited

Loading

fefi42 commented Jun 5, 2020

etiennedi commented Jun 5, 2020 •

edited

Loading

fefi42 commented Jun 5, 2020

etiennedi commented Jun 5, 2020

fefi42 commented Jun 8, 2020

bobvanluijt commented Jun 8, 2020

bobvanluijt commented Jun 8, 2020

fefi42 commented Jun 8, 2020

bobvanluijt commented Jun 8, 2020

etiennedi commented Jun 8, 2020 •

edited

Loading

fefi42 commented Jun 8, 2020 •

edited

Loading

etiennedi commented Jun 8, 2020 •

edited

Loading

bobvanluijt commented Jun 8, 2020

etiennedi commented Jun 8, 2020 •

edited

Loading

stale bot commented Aug 11, 2020

bobvanluijt commented Aug 12, 2020

stale bot commented Oct 11, 2020

Add _snippets in GraphQL and REST API #1159

Add _snippets in GraphQL and REST API #1159

Comments

bobvanluijt commented Jun 5, 2020

Idea

Explore example

Where example

Suggested (first) implementation

Related

bobvanluijt commented Jun 5, 2020

etiennedi commented Jun 5, 2020

fefi42 commented Jun 5, 2020

etiennedi commented Jun 5, 2020 • edited Loading

Advantages

Disadvantages

etiennedi commented Jun 5, 2020 • edited Loading

fefi42 commented Jun 5, 2020

etiennedi commented Jun 5, 2020 • edited Loading

fefi42 commented Jun 5, 2020

etiennedi commented Jun 5, 2020

fefi42 commented Jun 8, 2020

bobvanluijt commented Jun 8, 2020

bobvanluijt commented Jun 8, 2020

fefi42 commented Jun 8, 2020

bobvanluijt commented Jun 8, 2020

etiennedi commented Jun 8, 2020 • edited Loading

fefi42 commented Jun 8, 2020 • edited Loading

etiennedi commented Jun 8, 2020 • edited Loading

bobvanluijt commented Jun 8, 2020

etiennedi commented Jun 8, 2020 • edited Loading

stale bot commented Aug 11, 2020

bobvanluijt commented Aug 12, 2020

stale bot commented Oct 11, 2020

etiennedi commented Jun 5, 2020 •

edited

Loading

etiennedi commented Jun 5, 2020 •

edited

Loading

etiennedi commented Jun 5, 2020 •

edited

Loading

etiennedi commented Jun 8, 2020 •

edited

Loading

fefi42 commented Jun 8, 2020 •

edited

Loading

etiennedi commented Jun 8, 2020 •

edited

Loading

etiennedi commented Jun 8, 2020 •

edited

Loading