The Snippet::fragments member is misleading and needs a rename #916

liamwarfield · 2020-10-25T01:17:49Z

Is your feature request related to a problem? Please describe.
I am creating an app where I would like to break different fragment onto their own lines.

What my app currently outputs:

[ueuccfcgbx] Fucshia Arch: https://blog.quarkslab.com/playing-around-with-the-fuchsia-operating-system.html
address space is context-switched by **Zircon**.Contrary to other OSes however, the IOMMU
(Input-Output MMU), plays an important role on **Zircon**: it is

Notice how **Zircon**.Contrary is concatenated.

What I would like:

[ueuccfcgbx] Fucshia Arch: https://blog.quarkslab.com/playing-around-with-the-fuchsia-operating-system.html
address space is context-switched by **Zircon**. 
Contrary to other OSes however, the IOMMU(Input-Output MMU), plays an important role on **Zircon**: it is

Currently a Snippet stores fragments (a string that has all of the text of the snippet) and a vector of HighlightSections (what parts of fragments should be highlighted). I would like to break each fragment onto its own line to make this output more readable/understandable. Currently there is no way to know where one fragment ends, and another begins.

Describe the solution you'd like
Change the type of Snippet::fragments to Vec<String> and add a fragment_number member to HighlightSection.

[Optional] describe alternatives you've considered
Add a new member to Snippet similar to highlighted called fragment_sections, that is a Vec of start and stop points for the different fragments

The text was updated successfully, but these errors were encountered:

fulmicoton · 2020-10-25T01:30:16Z

I agree. Are you volunteering to pick this ticket?

liamwarfield · 2020-10-25T02:06:23Z

I'm up to picking this up! After looking at tantivy/src/snippet/mod.rs, it looks SnippetGenerator selects the fragment with the best score and only uses that one fragment. It looks like there are two options here:

Change Snippet to hold multiple fragments.
1. Change this line to return up to n fragments.
2. Change Snippet and HighlightedSection appropriately.
Change Snippet::fragments to Snippet::fragment (Currently the name is misleading) and add a function to SnippetGenerator snippets_from_doc(&self, doc: &Document, count: usize) -> Vec<Snippet> ordered by score.

I think the second change would be better because it would not change the Snippet struct (other than naming) and makes more sense from a users prospective. The first change could better described as a converting the Snippet type into a Snippets type. Both changes are breaking, do you prefer either one?

fulmicoton · 2020-10-27T00:56:40Z

I prefer change 1.
If you are up to it, there is an important improvement that you can add here.

Having a query term appears several times in the snippet is not really a win. Right now, the same term appearing twice in a fragment will result in twice the score of a fragment containing it only once.

Using that kind of score with several fragments is trickier, but using a greedy algorithm gives an excellent approximation of the best answer.

So

pick the fragment with the best score.
find the fragment that completes it the best. (ie. terms appearing in the first fragment are not used in the computation of the score of the second fragment. Or they contribute but to a lesser extent)

liamwarfield · 2020-10-29T20:59:04Z

I think one way we could make the scoring better would be giving the first fragment contain the first mention of a search term in a document a higher score. Usually the first time a term appears in most document is an introduction to that term. Does the token stream struct in search_fragments iterates through the document from beginning to end?

liamwarfield · 2020-10-29T21:30:39Z

I also got through a rough implemention for option 2 done before I read you comment.

fulmicoton · 2020-11-02T03:06:35Z

I think one way we could make the scoring better would be giving the first fragment contain the first mention of a search term in a document a higher score. Usually the first time a term appears in most document is an introduction to that term.

This makes sense

Does the token stream struct in search_fragments iterates through the document from beginning to end?

Yes. You also have a position attribute in the Token emitted by the TokenStream.

fulmicoton · 2020-11-02T03:07:52Z

I also got through a rough implemention for option 2 done before I read you comment.

Do you want to send a PR? Let me send you an invite to tantivy-dev. You can push your branch as a feature branch if you wnat.

liamwarfield · 2020-11-04T19:38:25Z

I've added testing and updated the snippet example my branch. What branch should I make the PR to?

bitzl · 2021-12-29T13:16:03Z

Thank you for your work, that's right what I was looking for 🤓

Any news on this or anything I can do to help?

liamwarfield · 2021-12-29T20:49:06Z

MB, Completely forgot about this thread. I got blocked last time because I did not know the guidelines for submitting a breaking change for review. @bitzl if you could explain (or point me towards a resource) how I should merge in a change that would be great! 🙂

liamwarfield · 2021-12-30T19:55:50Z

Just finished up testing with the newest tagged release. it looks like the original problem I opened this thread with no longer exists. snippets_from_doc has been changed to snippet_from_doc, and results from snippet_from_doc only contain a single fragment of the origin text. The only thing that's still feels a little janky here is the fragments member of the Snippet struct. IMHO it should be renamed to fragment, because the code here only returns 1 fragment, and the doc string also describes it as a single fragment from the original document.

https://github.com/quickwit-inc/tantivy/blob/7234bef0eb84f0b8167827dd71caac6a16df28c6/src/snippet/mod.rs#L52-L58

I think there are two things here:

Snippet.fragments -> Snippet.fragment
"Having a query term appears several times in the snippet is not really a win. Right now, the same term appearing twice in a fragment will result in twice the score of a fragment containing it only once." This feels like a separate issue. I'll try playing around with the scoring function in my own projects to see what things feel better. I'll try some stuff on my own projects:

giving higher scores to the first fragment in a document with a search term
containing different terms should count more than multiple uses of a single term.

liamwarfield · 2021-12-30T20:28:37Z

Nvm The scoring is now based off of how many documents a term shows up in:
https://github.com/quickwit-inc/tantivy/blob/e5e252cbc020550fecc06bce08566b6e42598464/src/snippet/mod.rs#L246-L250

liamwarfield · 2021-12-30T20:31:43Z

No wait, it was the same way Oct 2020:
https://github.com/quickwit-inc/tantivy/blob/c23a03ad8191ca928d96b668f8800a4ab22b2e13/src/snippet/mod.rs#L271-L275

bitzl · 2021-12-30T21:58:36Z

@liamwarfield That looks great, I'm excited to use it once it is merged.

Unfortunately I am not a maintainer, just another excited user. So when I asked what I could do to help I meant by contributing, not by assisting how to contribute (I don't know much about that, tbh).

liamwarfield · 2022-01-03T19:26:30Z

we've been merged in, I'll close this issue and open up an new one on snippet scoring!

fulmicoton added the enhancement label Oct 25, 2020

liamwarfield changed the title ~~Snippet fragments should be Vec<String> not String~~ The Snippet::fragments member is misleading and needs a rename Nov 4, 2020

liamwarfield self-assigned this Nov 4, 2020

liamwarfield mentioned this issue Dec 31, 2021

Change Snippet.fragments -> Snippet.fragment #1243

Merged

liamwarfield closed this as completed Jan 3, 2022

liamwarfield mentioned this issue Jan 3, 2022

Better Snippet Scoring #1249

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The Snippet::fragments member is misleading and needs a rename #916

The Snippet::fragments member is misleading and needs a rename #916

liamwarfield commented Oct 25, 2020

fulmicoton commented Oct 25, 2020

liamwarfield commented Oct 25, 2020

fulmicoton commented Oct 27, 2020

liamwarfield commented Oct 29, 2020

liamwarfield commented Oct 29, 2020

fulmicoton commented Nov 2, 2020

fulmicoton commented Nov 2, 2020

liamwarfield commented Nov 4, 2020

bitzl commented Dec 29, 2021

liamwarfield commented Dec 29, 2021

liamwarfield commented Dec 30, 2021

liamwarfield commented Dec 30, 2021

liamwarfield commented Dec 30, 2021

bitzl commented Dec 30, 2021

liamwarfield commented Jan 3, 2022

The Snippet::fragments member is misleading and needs a rename #916

The Snippet::fragments member is misleading and needs a rename #916

Comments

liamwarfield commented Oct 25, 2020

fulmicoton commented Oct 25, 2020

liamwarfield commented Oct 25, 2020

fulmicoton commented Oct 27, 2020

liamwarfield commented Oct 29, 2020

liamwarfield commented Oct 29, 2020

fulmicoton commented Nov 2, 2020

fulmicoton commented Nov 2, 2020

liamwarfield commented Nov 4, 2020

bitzl commented Dec 29, 2021

liamwarfield commented Dec 29, 2021

liamwarfield commented Dec 30, 2021

liamwarfield commented Dec 30, 2021

liamwarfield commented Dec 30, 2021

bitzl commented Dec 30, 2021

liamwarfield commented Jan 3, 2022