Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The Snippet::fragments member is misleading and needs a rename #916

Closed
liamwarfield opened this issue Oct 25, 2020 · 15 comments
Closed

The Snippet::fragments member is misleading and needs a rename #916

liamwarfield opened this issue Oct 25, 2020 · 15 comments
Assignees

Comments

@liamwarfield
Copy link
Contributor

Is your feature request related to a problem? Please describe.
I am creating an app where I would like to break different fragment onto their own lines.

What my app currently outputs:

[ueuccfcgbx] Fucshia Arch: https://blog.quarkslab.com/playing-around-with-the-fuchsia-operating-system.html
address space is context-switched by **Zircon**.Contrary to other OSes however, the IOMMU
(Input-Output MMU), plays an important role on **Zircon**: it is

Notice how **Zircon**.Contrary is concatenated.

What I would like:

[ueuccfcgbx] Fucshia Arch: https://blog.quarkslab.com/playing-around-with-the-fuchsia-operating-system.html
address space is context-switched by **Zircon**. 
Contrary to other OSes however, the IOMMU(Input-Output MMU), plays an important role on **Zircon**: it is

Currently a Snippet stores fragments (a string that has all of the text of the snippet) and a vector of HighlightSections (what parts of fragments should be highlighted). I would like to break each fragment onto its own line to make this output more readable/understandable. Currently there is no way to know where one fragment ends, and another begins.

Describe the solution you'd like
Change the type of Snippet::fragments to Vec<String> and add a fragment_number member to HighlightSection.

[Optional] describe alternatives you've considered
Add a new member to Snippet similar to highlighted called fragment_sections, that is a Vec of start and stop points for the different fragments

@fulmicoton
Copy link
Collaborator

I agree. Are you volunteering to pick this ticket?

@liamwarfield
Copy link
Contributor Author

I'm up to picking this up! After looking at tantivy/src/snippet/mod.rs, it looks SnippetGenerator selects the fragment with the best score and only uses that one fragment. It looks like there are two options here:

  1. Change Snippet to hold multiple fragments.

    1. Change this line to return up to n fragments.
    2. Change Snippet and HighlightedSection appropriately.
  2. Change Snippet::fragments to Snippet::fragment (Currently the name is misleading) and add a function to SnippetGenerator snippets_from_doc(&self, doc: &Document, count: usize) -> Vec<Snippet> ordered by score.

I think the second change would be better because it would not change the Snippet struct (other than naming) and makes more sense from a users prospective. The first change could better described as a converting the Snippet type into a Snippets type. Both changes are breaking, do you prefer either one?

@fulmicoton
Copy link
Collaborator

I prefer change 1.
If you are up to it, there is an important improvement that you can add here.

Having a query term appears several times in the snippet is not really a win. Right now, the same term appearing twice in a fragment will result in twice the score of a fragment containing it only once.

Using that kind of score with several fragments is trickier, but using a greedy algorithm gives an excellent approximation of the best answer.

So

  • pick the fragment with the best score.
  • find the fragment that completes it the best. (ie. terms appearing in the first fragment are not used in the computation of the score of the second fragment. Or they contribute but to a lesser extent)

@liamwarfield
Copy link
Contributor Author

I think one way we could make the scoring better would be giving the first fragment contain the first mention of a search term in a document a higher score. Usually the first time a term appears in most document is an introduction to that term. Does the token stream struct in search_fragments iterates through the document from beginning to end?

@liamwarfield
Copy link
Contributor Author

I also got through a rough implemention for option 2 done before I read you comment.

@fulmicoton
Copy link
Collaborator

I think one way we could make the scoring better would be giving the first fragment contain the first mention of a search term in a document a higher score. Usually the first time a term appears in most document is an introduction to that term.

This makes sense

Does the token stream struct in search_fragments iterates through the document from beginning to end?

Yes. You also have a position attribute in the Token emitted by the TokenStream.

@fulmicoton
Copy link
Collaborator

I also got through a rough implemention for option 2 done before I read you comment.

Do you want to send a PR? Let me send you an invite to tantivy-dev. You can push your branch as a feature branch if you wnat.

@liamwarfield
Copy link
Contributor Author

I've added testing and updated the snippet example my branch. What branch should I make the PR to?

@liamwarfield liamwarfield changed the title Snippet fragments should be Vec<String> not String The Snippet::fragments member is misleading and needs a rename Nov 4, 2020
@liamwarfield liamwarfield self-assigned this Nov 4, 2020
@bitzl
Copy link

bitzl commented Dec 29, 2021

Thank you for your work, that's right what I was looking for 🤓

Any news on this or anything I can do to help?

@liamwarfield
Copy link
Contributor Author

MB, Completely forgot about this thread. I got blocked last time because I did not know the guidelines for submitting a breaking change for review. @bitzl if you could explain (or point me towards a resource) how I should merge in a change that would be great! 🙂

@liamwarfield
Copy link
Contributor Author

Just finished up testing with the newest tagged release. it looks like the original problem I opened this thread with no longer exists. snippets_from_doc has been changed to snippet_from_doc, and results from snippet_from_doc only contain a single fragment of the origin text. The only thing that's still feels a little janky here is the fragments member of the Snippet struct. IMHO it should be renamed to fragment, because the code here only returns 1 fragment, and the doc string also describes it as a single fragment from the original document.

https://github.com/quickwit-inc/tantivy/blob/7234bef0eb84f0b8167827dd71caac6a16df28c6/src/snippet/mod.rs#L52-L58

I think there are two things here:

  1. Snippet.fragments -> Snippet.fragment
  2. "Having a query term appears several times in the snippet is not really a win. Right now, the same term appearing twice in a fragment will result in twice the score of a fragment containing it only once." This feels like a separate issue. I'll try playing around with the scoring function in my own projects to see what things feel better. I'll try some stuff on my own projects:
  • giving higher scores to the first fragment in a document with a search term
  • containing different terms should count more than multiple uses of a single term.

@liamwarfield
Copy link
Contributor Author

Nvm The scoring is now based off of how many documents a term shows up in:
https://github.com/quickwit-inc/tantivy/blob/e5e252cbc020550fecc06bce08566b6e42598464/src/snippet/mod.rs#L246-L250

@liamwarfield
Copy link
Contributor Author

@bitzl
Copy link

bitzl commented Dec 30, 2021

@liamwarfield That looks great, I'm excited to use it once it is merged.

Unfortunately I am not a maintainer, just another excited user. So when I asked what I could do to help I meant by contributing, not by assisting how to contribute (I don't know much about that, tbh).

@liamwarfield
Copy link
Contributor Author

we've been merged in, I'll close this issue and open up an new one on snippet scoring!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants