-
-
Notifications
You must be signed in to change notification settings - Fork 610
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The Snippet::fragments member is misleading and needs a rename #916
Comments
I agree. Are you volunteering to pick this ticket? |
I'm up to picking this up! After looking at tantivy/src/snippet/mod.rs, it looks
I think the second change would be better because it would not change the |
I prefer change 1. Having a query term appears several times in the snippet is not really a win. Right now, the same term appearing twice in a fragment will result in twice the score of a fragment containing it only once. Using that kind of score with several fragments is trickier, but using a greedy algorithm gives an excellent approximation of the best answer. So
|
I think one way we could make the scoring better would be giving the first fragment contain the first mention of a search term in a document a higher score. Usually the first time a term appears in most document is an introduction to that term. Does the token stream struct in search_fragments iterates through the document from beginning to end? |
I also got through a rough implemention for option 2 done before I read you comment. |
This makes sense
Yes. You also have a position attribute in the Token emitted by the TokenStream. |
Do you want to send a PR? Let me send you an invite to tantivy-dev. You can push your branch as a feature branch if you wnat. |
I've added testing and updated the snippet example my branch. What branch should I make the PR to? |
Thank you for your work, that's right what I was looking for 🤓 Any news on this or anything I can do to help? |
MB, Completely forgot about this thread. I got blocked last time because I did not know the guidelines for submitting a breaking change for review. @bitzl if you could explain (or point me towards a resource) how I should merge in a change that would be great! 🙂 |
Just finished up testing with the newest tagged release. it looks like the original problem I opened this thread with no longer exists. I think there are two things here:
|
Nvm The scoring is now based off of how many documents a term shows up in: |
No wait, it was the same way Oct 2020: |
@liamwarfield That looks great, I'm excited to use it once it is merged. Unfortunately I am not a maintainer, just another excited user. So when I asked what I could do to help I meant by contributing, not by assisting how to contribute (I don't know much about that, tbh). |
we've been merged in, I'll close this issue and open up an new one on snippet scoring! |
Is your feature request related to a problem? Please describe.
I am creating an app where I would like to break different fragment onto their own lines.
What my app currently outputs:
Notice how
**Zircon**.Contrary
is concatenated.What I would like:
Currently a
Snippet
stores fragments (a string that has all of the text of the snippet) and a vector ofHighlightSections
(what parts of fragments should be highlighted). I would like to break each fragment onto its own line to make this output more readable/understandable. Currently there is no way to know where one fragment ends, and another begins.Describe the solution you'd like
Change the type of
Snippet::fragments
toVec<String>
and add afragment_number
member toHighlightSection
.[Optional] describe alternatives you've considered
Add a new member to Snippet similar to highlighted called fragment_sections, that is a Vec of start and stop points for the different fragments
The text was updated successfully, but these errors were encountered: