Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add _snippets in GraphQL and REST API #1159

Closed
bobvanluijt opened this issue Jun 5, 2020 · 22 comments
Closed

Add _snippets in GraphQL and REST API #1159

bobvanluijt opened this issue Jun 5, 2020 · 22 comments
Labels
autoclosed Closed by the bot. We still want this, but it didn't quite make the latest prioritization round Contextionary graphql _underscoreProp

Comments

@bobvanluijt
Copy link
Member

When an object is indexed on a larger text item (e.g., a paragraph like in the news article demo) certain search terms can be found in sentences.

The idea is to add a starting point and endpoint of the most important part of the text corpus in the __meta end-point as a potentialAnswer which can be enabled or disabled by setting an distanceToAnswer. This can work both for explore filters as where filters.

Idea

I was searching for something on WikiPedia under the search term: Is herbalife a pyramid scheme? and got this response.

Because Google isn't giving the actual answer but a location for the answer, we should be able to calculate something similar.

Screenshot 2020-06-05 at 10 53 31

Explore example

{
  Get{
    Things{
      Article(
        explore: {
          concepts: ["I want a spare rib"],
          certainty: 0.7,
          moveAwayFrom: {
            concepts: ["bacon"],
            force: 0.45
          }
        }
      ){
        name
        __meta {
          potentialAnswer(
            distanceToAnswer: 0.5 # <== optional
          ) {
            start
            end
            property
            distanceToQuery
          }
        }
      }
    }
  }
}

Result where the start and end give the starting and ending position and in which property the answer / most important part can be found.

{
    "data": {
        "Get": {
            "Things": {
                "Article": [
                    {
                        "name": "Bacon ipsum dolor amet tri-tip hamburger leberkas short ribs chicken turkey sirloin tenderloin shoulder pig bresaola. Pastrami ham hock meatball rump ribeye cupim, capicola venison burgdoggen brisket meatloaf. Turducken t-bone landjaeger pork chop, bresaola pig prosciutto pastrami sausage pancetta capicola short ribs hamburger tail spare ribs. Jerky kevin doner cupim pork belly picanha, pancetta capicola pork loin alcatra corned beef shank. Bacon chislic landjaeger doner corned beef, hamburger beef ribs filet mignon turducken tri-tip andouille pastrami chuck pork loin capicola. Prosciutto shankle chislic, shoulder tri-tip turducken meatball ham pork loin fatback hamburger pork chop bacon pork belly. Kevin sausage salami spare ribs tenderloin t-bone meatball picanha flank jowl pork chop tail turducken tri-tip.",
                        "__meta": [ // <== array because multiple results are possible and/or multiple 
                            {
                                "property": "name",
                                "distanceToQuery": 0.0, // <== distance to the query
                                "start": 26,// <== just a random example
                                "end": 130  // <== just a random example
                            }
                        ]
                    }
                ]
            }
        }
    },
    "errors": null
}

Where example

{
  Get {
    Things {
      Article(where: {
            path: ["name"],
            operator: Like,
            valueString: "New *"
        }) {
        name
        __meta {
          potentialAnswer(
            distanceToAnswer: 0.5 # <== optional
          ) {
            start
            end
            property
            distanceToQuery
          }
        }
      }
    }
  }
}

Result where the start and end give the starting and ending position and in which property the answer / most important part can be found.

{
    "data": {
        "Get": {
            "Things": {
                "Article": [
                    {
                        "name": "Bacon ipsum dolor amet tri-tip hamburger leberkas short ribs chicken turkey sirloin tenderloin shoulder pig bresaola. Pastrami ham hock meatball rump ribeye cupim, capicola venison burgdoggen brisket meatloaf. Turducken t-bone landjaeger pork chop, bresaola pig prosciutto pastrami sausage pancetta capicola short ribs hamburger tail spare ribs. Jerky kevin doner cupim pork belly picanha, pancetta capicola pork loin alcatra corned beef shank. Bacon chislic landjaeger doner corned beef, hamburger beef ribs filet mignon turducken tri-tip andouille pastrami chuck pork loin capicola. Prosciutto shankle chislic, shoulder tri-tip turducken meatball ham pork loin fatback hamburger pork chop bacon pork belly. Kevin sausage salami spare ribs tenderloin t-bone meatball picanha flank jowl pork chop tail turducken tri-tip.",
                        "__meta": [ // <== array because multiple results are possible and/or multiple properties might be indexed
                            {
                                "property": "name",
                                "distanceToQuery": 0.0, // <== distance to the query
                                "start": 26,// <== just a random example
                                "end": 130  // <== just a random example
                            }
                        ]
                    }
                ]
            }
        }
    },
    "errors": null
}

Suggested (first) implementation

  1. Results are returned like the current implementation.
  2. The vectorization of the query is used to find the closest match of a word in a sentence. *
  3. When the closest word is found, the start and endpoint are found at the beginning and end of the sentence.
  4. The distanceToAnswer if the minimal distance, if it is not set, no start and end-points will be available, if multiple sentences make the mark, they will all be part of the array.

*- there might be potential to also do this on groups of words or complete sentences.

Related

#1136 #1139 #1155 #1156

@bobvanluijt
Copy link
Member Author

Note: the JSON result misses the potentialAnswer key, but the concept is the same

@etiennedi
Copy link
Member

Great idea, some thoughts

  1. @michaverhagen and @fefi42 Can one of you build a minimal poc to validate this outside of Weaviate as part of the usecase that inspired this idea. If the results are good, we can think about incorporating this as a feature.
  2. In the naming maybe we could use something more general than "answer" as this feature would also benefit you if you don't specify a question, but just a "regular search term".

@fefi42
Copy link
Contributor

fefi42 commented Jun 5, 2020

I also like the idea. I can build a small prototype next week.

we could use something more general than "answer"

My suggestion would be paragraph or snippet

@etiennedi
Copy link
Member

etiennedi commented Jun 5, 2020

An alternative that we should also consider and validate this against:

  1. Create two classes "Paragraph" and "Article" with "Paragraph.OfArticle" reference and "Article.hasParagraphs"
  2. On import time, split up article into paragraphs and index as individual paragraphs with link to title
  3. On search, search through paragraphs rather than article. Once the best paragraph is found, present both the paragraph as well as the whole article to the user

Advantages

  • Should be much faster, because the vector positions for each paragraph are craeted at import time rather than at search time and can therefor be indexed for an efficient search (e.g. HNSW)

Disadvantages

  • Requires additional logic on top of Weaviate

@etiennedi
Copy link
Member

etiennedi commented Jun 5, 2020

@bobvanluijt regarding the suggested implementation.

Assume the results contain 10 * 1000 word articles. If we do this per word that would be 10000 vectorizations (minues potential duplicates) plus 10,000 (minus potential duplicates) vector comparisons. This might be (needs evaluation) too expensive for query time. For an alternative approach that offloads some of the cost to index time, see the previous post.

@fefi42
Copy link
Contributor

fefi42 commented Jun 5, 2020

Couldn't we do a binary search? This would cut our search from O(n) to O(log n).

@etiennedi
Copy link
Member

etiennedi commented Jun 5, 2020

Couldn't we do a binary search? This would cut our search from O(n) to O(log n).

  1. A binary search requires a sorted set, an article is typically not a list of sorted words
  2. This would only help us for exact occurrences of a word, not for contextual similarity

What we can (relatively easily*) do is just record the word occurrence position in the inverse index (I think lucene does the same), but that would again only help for exact matches, not for vector-similarity matches.

* = once we have the custom db strategy

@fefi42
Copy link
Contributor

fefi42 commented Jun 5, 2020

I thought about something like this:

  1. split text in half
  2. calculate distance of query to both parts
  3. select half with smaller distance.
  4. continue with 1 until on paragraph size

@etiennedi
Copy link
Member

I thought about something like this:

I like the idea, but definitely requires validation. Not sure if you split a 1000 word article into two 500 word articles if a single sentence influences the vector position enough. But definitely worth a shot. It would indeed reduce the number of comparisons (but not the number of vectorizations)

@etiennedi etiennedi added this to the _meta in GraphQL Get milestone Jun 5, 2020
fefi42 pushed a commit that referenced this issue Jun 8, 2020
@fefi42
Copy link
Contributor

fefi42 commented Jun 8, 2020

So far I ran a test for 4 different queries.
Article 1: 98 words
Article 2: 292 words
Article 3: 98 words (same as 1 but different question)
Article 4: 2709 words

Linear search (word by word):
This is by far the slowest but also gives worse results than the other searches.
Example question: "How much did the Manhattan Project cost?"
This question is closest to the word "Manhattan" which appears multiple times in the text.
Therefore is the returned word/sentence not necessarily the one that is actually closest to the question asked.

Speeds:
1: 187.143164ms
2: 547.444578ms
3: 209.306972ms
4: 5.793083927s

Divide and search:
The speed of this depends on the stop criterion.
It divides the corpus until its length is less than n.

n = 50
Speeds:
1: 99.835597ms
2: 161.984499ms
3: 71.573273ms
4: 921.776241ms

n = 100 is competitive with pre-defined snippets.
Speeds:
44.415232ms
110.879809ms
42.971915ms
710.416374ms

Pre defined snippets:
In this case the corpus was already split by sentences. Then a linear search for the closest sentence is done.

Speeds:
1: 49.154119ms
2: 107.153473ms
3: 44.226528ms
4: 687.757241ms

Conclusion: Divide and search can be valid strategy compared to predefined snippets. However all of the tests break the 100ms mark when using more text.

How should we proceed?
I have not tested a lot of other data sets yet because I want to know if these results are worth going deeper into this. Over half a second is a long time for searching on a text file imo.

@bobvanluijt
Copy link
Member Author

My suggestion would be paragraph or snippet

@fefi42 I like snippet

So far I ran a test for 4 different queries.

@fefi42 just to make sure, this is only on the objects that the explore function returns and not all objects, right? All are < 1 second isn't it?

@bobvanluijt
Copy link
Member Author

@bobvanluijt regarding the suggested implementation.

@etiennedi please note that my spike does it per sentence and not per word ;-)

@fefi42
Copy link
Contributor

fefi42 commented Jun 8, 2020

just to make sure, this is only on the objects that the explore function returns and not all objects, right?

The time is only the search for the text snippet. Aka the time a function findSnippet(text, query) (snippet, distance) takes to execute.

All are < 1 second isn't it?

The linear search took 5.7 seconds to execute on the text corpus.

The pre defined snippets test is equal to what bob did before.

@bobvanluijt
Copy link
Member Author

The pre defined snippets test is equal to what bob did before.

Cool, bob, that's me.

The reason I predefined is, it to not do it on a word for word basis but based on complete sentences to significantly speed the process up.

@etiennedi
Copy link
Member

etiennedi commented Jun 8, 2020

Thanks a lot @fefi42 for the investigation so far.

I am surprised by how slow this is. At the moment we are orders of magnitudes away from this being feasible for implementation. We need to get in the 200ms (better 100ms) range on an average query for this to be viable otherwise it'll be an operations nightmare under load. We don't have a definition of what an average query is, but a length of 4 seems very low to me. I'd assume "normal" is more in the range of 10-20. So a single article would be allowed to take 20ms max without incorporating the actual query time yet.

@fefi42 was the result quality between divide&search and linear comparable?

I've you have more time could you try:

  • Splitting by every sentence instead of every word. (As a simplification assume every dot or line break indicates a new sentence) EDIT: I just saw this is what you've already done as "pre-defined snippets"
  • Measure what it is that actually takes so much time. (If it's the subsequent requests to vectorize, then a potential timesaver would be one big one instead of many small ones)

If we cannot get this into a feasible response time, I see the possible alternatives:

  • Do the approach outlined in Add _snippets in GraphQL and REST API #1159 (comment) where all the work happens at import time and at search time we just have a normal search
  • Limit this feature to the very first search result. With the idea being "we took the best result and further searched inside it"

@fefi42
Copy link
Contributor

fefi42 commented Jun 8, 2020

The vectorization costs the most time. For the divide strategy the speed test looks like this:

For 98 words:

init 182ns
split corpus 93ns
vectorize splitted corpus 43.196993ms
distances 7.279µs
split corpus 77ns
vectorize splitted corpus 13.169368ms
distances 2.202µs
split corpus 89ns
vectorize splitted corpus 9.059969ms
distances 3.434µs


69.568684ms



For 292 words:

init 171ns
split corpus 111ns
vectorize splitted corpus 53.721253ms
distances 2.225µs
split corpus 81ns
vectorize splitted corpus 19.029338ms
distances 3.674µs
split corpus 111ns
vectorize splitted corpus 22.840342ms
distances 2.272µs
split corpus 76ns
vectorize splitted corpus 12.428266ms
distances 2.177µs
split corpus 93ns
vectorize splitted corpus 7.802975ms
distances 4.09µs


118.742434ms



For 2709 words:

init 87ns
split corpus 86ns
vectorize splitted corpus 335.728687ms
distances 2.315µs
split corpus 75ns
vectorize splitted corpus 157.303306ms
distances 2.293µs
split corpus 75ns
vectorize splitted corpus 112.149237ms
distances 3.581µs
split corpus 84ns
vectorize splitted corpus 47.402019ms
distances 2.362µs
split corpus 78ns
vectorize splitted corpus 35.664391ms
distances 2.903µs
split corpus 76ns
vectorize splitted corpus 13.888474ms
distances 3.371µs
split corpus 93ns
vectorize splitted corpus 10.601051ms
distances 3.501µs
split corpus 89ns
vectorize splitted corpus 10.730858ms
distances 3.346µs


730.08619ms



Vectorizing on the fly takes almost all of the time.

@etiennedi
Copy link
Member

etiennedi commented Jun 8, 2020

Thanks.

Vectorizing on the fly takes almost all of the time.

That's a pretty strong argument to do this outside of weaviate and use weaviate's schema as a helper to achieve the same effect as outlined in #1159 (comment)

@bobvanluijt
Copy link
Member Author

That's a pretty strong argument to do this outside of weaviate and use weaviate's schema as a helper to achieve the same effect as outlined in #1159 (comment)

I do understand this argument, but ideally, it would be part of Weaviate during searching (or vectorizing on import time e.g., if the dataType = text)

@etiennedi
Copy link
Member

etiennedi commented Jun 8, 2020

The above shows that at search time is not really an option. So at index time would certainly possible, essentially instead of storing one vector per item, we would store n vectors per item, where n depends on the split criteria (e.g. word, sentence, paragraph).

Since that's potentially a lot of additional vectors, we should probably make this optional. :-)

This is however a bit of a bigger change as it fundamentally changes how we store things, so I'd still recommend to implement this outside of Weaviate* for the first use case, as I can't give a reliable estimate and ETA on this change.

* = it's not really outside so much as on top of Weaviate

@etiennedi etiennedi changed the title Add suggested location for "answers" in __meta Add _snippets in GraphQL and REST API Jun 12, 2020
@etiennedi etiennedi modified the milestones: _meta in GraphQL Get, Underscore Properties in both APIs Jun 12, 2020
@stale
Copy link

stale bot commented Aug 11, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the autoclosed Closed by the bot. We still want this, but it didn't quite make the latest prioritization round label Aug 11, 2020
@bobvanluijt
Copy link
Member Author

Still a feature I would love to see in the not-to-distant future, hence the un-stale message

@stale stale bot removed the autoclosed Closed by the bot. We still want this, but it didn't quite make the latest prioritization round label Aug 12, 2020
@stale
Copy link

stale bot commented Oct 11, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the autoclosed Closed by the bot. We still want this, but it didn't quite make the latest prioritization round label Oct 11, 2020
@stale stale bot closed this as completed Oct 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
autoclosed Closed by the bot. We still want this, but it didn't quite make the latest prioritization round Contextionary graphql _underscoreProp
Projects
None yet
Development

No branches or pull requests

3 participants