Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

first pass at browse offset and match highlighting #255

Closed

Conversation

hackartisan
Copy link
Contributor

Note there are TODOs (most notably the config file piece).

Seeking input on:

  • how to deal with the method signature of alphabeticBrowse in VuFindSearch/Backend/Solr/Backend
  • css styling (currently a blue box but if nothing else the color should change)
  • is there a better way to figure out whether the result solr returns is an exact match? I'm actually fine with the rough comparison (uppercase, strip spaces) but if there's any easy and more precise way to tell I'd use that.

@demiankatz
Copy link
Member

Just a few comments for now:

1.) I'll second Tod's comment (made in an email off the PR) that the $offset variable/parameter in the PHP should probably be renamed to avoid confusion with the offset parameter of the browse handler. $rows_before works for me if you have no better name in mind.

2.) It probably makes sense to hold off on merging this until we can include the highlight matching inside the browse handler. I'm definitely not the best person to tackle that, but if nobody else has time, I might be able to look at it eventually -- would be a good excuse to learn the code a little better.

3.) I notice that you're implementing this in the old bootstrap theme. I'm about to delete that from master (either today or tomorrow). I'll most likely hold off on doing that until I hear more from you, but it would be helpful if you could revise the PR to use bootstrap3 or blueprint instead.

@hackartisan
Copy link
Contributor Author

re 1) no problem
re 2) I'm hoping Mark can make quick work of it but if needed I may be able to take a look.
re 3) Yeah we still have a way to go before we're off bootstrap and on bootstrap3. I can revise the PR, though.

@demiankatz
Copy link
Member

re 1) great! Please leave a comment when a revised version is ready for review.
re 2) Sounds good; we'll see what Mark has to say.
re 3) The change would be helpful to ensure that the PR remains mergeable; let us know if we can do anything more to help ease your transition to bootstrap3 -- I don't want to pull the rug out from under you, but I do hope to complete the deprecation of that theme soon, since the lack of bootstrap templates is holding up a bunch of other PR's, and they can all move forward if we take the bootstrap theme off the table.

@hackartisan
Copy link
Contributor Author

Demian, don't let us hold you up on the theme deprecation. We'll catch up before too long and I'll probably deploy this stuff locally before it's accepted to master, anyway. You've given ample warning and we'll definitely reach out if we need to.

@demiankatz
Copy link
Member

Excellent, thanks!

@marktriggs
Copy link
Contributor

On 2), I'll give the highlighting a shot and put up a PR for that. So in the results I'll try to indicate:

  • The row in the results that was the match
  • Whether it was an exact match or, otherwise, how many characters of the search string were matched

@hackartisan
Copy link
Contributor Author

Thanks, Mark! I think it would be easier to use a simple bool for exact head-of-string match. Because the number of characters might differ between the normalized and non-normalized form of the search.

@marktriggs
Copy link
Contributor

Ah yep, that makes sense. And thinking out loud on how that would work, does this sort of approach make sense?

 user_input = <what the user typed>
 matched_heading = <first browse heading matched>

 user_input_key = generate_sort_key(user_input)

 if (generate_sort_key(matched_heading) == user_input_key) {
   # exact match on user input
 else {
   # was there a head of string match?
   for (i = 1; i < matched_heading.length; i++) {
     head_of_string = substring(matched_heading, 0, i)
     if (generate_sort_key(head_of_string) == user_input_key) {
        # Got a head of string match
        return true
     }
   }
 }

So I guess my thought is to generate successively larger substrings of the matched header, and checking whether we get a match against the user's input (using the normalized sort key for the comparison). Does that seem like it would work?

@demiankatz
Copy link
Member

That approach makes sense to me. One other question for the crowd: obviously, there will be scenarios where the user input matches the head of multiple strings. How do we want to handle that?

@hackartisan
Copy link
Contributor Author

Oh! I think our interpretations of 'exact head-of-string match' are slightly different. I was thinking of something like the following example:

user input: PS153
matched heading: PS153.A73H37 2011
exact head-of-string match: true

So more like this

user_input_key = generate_sort_key(<what the user typed>)
matched_heading_key = generate_sort_key(<first browse heading matched>)

if (matched_heading_key == user_input_key) {
   # exact match on user input
else if user_input_key is a left-anchored substring of matched_heading_key {
   # head of string match
 }

@hackartisan
Copy link
Contributor Author

Demian, I think right now we're only concerned with the status of the match on the 'first match' row (that is, the first row returned once the offset is compensated for.

@hackartisan
Copy link
Contributor Author

To further my example, I'm not interested at all in whether some substring of the user input is a head-of-string match for the heading

user input: PS151
matched heading: PS153.A73H37 2011
exact head-of-string match: false (i.e. the fact that the first 4 characters match is useless to me)

@demiankatz
Copy link
Member

I think Mark's example is actually matching your interpretation of what should be happening. The reason it's going on in a loop is because we have to test every possible substring of the matched heading against the user input heading, since we don't know if there's a 1:1 relationship between input string length and key string length.

@hackartisan
Copy link
Contributor Author

Oh okay! cool! I didn't realize there wasn't a substring test in Java.

@demiankatz
Copy link
Member

It's not that there isn't a substring test in Java; it's just that the relationship between sort keys and headings is flexible enough that you can't make any assumptions about what keys look like. Suppose, for (contrived) example, that a key is generated as an integer value calculated from the text of the string. You can only test for substring matches by generating keys for all possible substrings and comparing them. You can't do substring checking against the keys themselves, because they don't work the same way as "regular strings." (Not that any of this is particularly important... just trying to shed light on the subject).

@hackartisan
Copy link
Contributor Author

I see, yes, because the normalizers return byte arrays. Thanks for expanding on the explanation. Exactly the sort of thing that makes it much faster for Mark to implement than for me to do it!

@marktriggs
Copy link
Contributor

Cool! Thanks all. I'll see how I go :)

@marktriggs
Copy link
Contributor

Hi all,

I've made a start on this. While testing I noticed an interesting case that I thought I'd get the group's feedback on.

I've been testing on title browse, which builds sort keys based on title_sort and displays headings from the title_fullStr field. One of the documents in my test data looks like this:

title_fullStr: The nature of love
title_sort: nature of love

So, the biblio indexing process has dropped the word "the" to produce title_sort. When the indexer runs, it turns title_sort into a normalized byte array (for the sort key), and stores "The nature of love" in the SQLite index file.

This makes the behaviour at search time a little confusing. If I browse for "the nature of love", I actually don't get the above match, because my query contains the word "the" while the sort key does not. I have to browse for "nature of love" to actually get this match. I guess this has always been the case, and maybe it isn't a big problem.

But, when I browse for "nature of love", my new code tries to work out whether this is an exact or start-of-string match against the display heading, not the sort heading. So it looks for a common prefix between "the nature of love" (the display heading) and "nature of love" (my query), and finds none.

So I'm a bit unsure what to do for these cases. I could modify the browse indexes to store the original sort key text and use that for my prefix checking, but if the sort key is quite different to the display heading, knowing that there was a prefix match on the sort key might not help you much in highlighting the match on the display heading. That still seems better than nothing, but what do others think?

@todolson
Copy link

Compare to the sort key, not the display. The sort key is what controls the ordering and where the search term lands in the index, and includes all normalization. Normalize the search term the way you would at indexing time, no need to second-guess.

If a site wants to allow matches titles both with and without an initial article, we can arrange it by adding both versions to the browse index. That’s how the UChicago side is configured, so it can be done, and with little fuss. So definitely take the term as-is, normalize as usual, and work from that.

-Tod

On Dec 10, 2014, at 9:08 PM, Mark Triggs <notifications@github.commailto:notifications@github.com> wrote:

Hi all,

I've made a start on this. While testing I noticed an interesting case that I thought I'd get the group's feedback on.

I've been testing on title browse, which builds sort keys based on title_sort and displays headings from the title_fullStr field. One of the documents in my test data looks like this:

title_fullStr: The nature of love
title_sort: nature of love

So, the biblio indexing process has dropped the word "the" to produce title_sort. When the indexer runs, it turns title_sort into a normalized byte array (for the sort key), and stores "The nature of love" in the SQLite index file.

This makes the behaviour at search time a little confusing. If I browse for "the nature of love", I actually don't get the above match, because my query contains the word "the" while the sort key does not. I have to browse for "nature of love" to actually get this match. I guess this has always been the case, and maybe it isn't a big problem.

But, when I browse for "nature of love", my new code tries to work out whether this is an exact or start-of-string match against the display heading, not the sort heading. So it looks for a common prefix between "the nature of love" (the display heading) and "nature of love" (my query), and finds none.

So I'm a bit unsure what to do for these cases. I could modify the browse indexes to store the original sort key text and use that for my prefix checking, but if the sort key is quite different to the display heading, knowing that there was a prefix match on the sort key might not help you much in highlighting the match on the display heading. That still seems better than nothing, but what do others think?


Reply to this email directly or view it on GitHubhttps://github.com//pull/255#issuecomment-66562950.

@marktriggs
Copy link
Contributor

Thanks Tod, that matches where I was heading too. I've put up an initial version showing the changes required to make this work:

vufind-org/vufind-browse-handler@203f125

The changes affect the indexing process too, so people would need to rebuild their browse indexes for this new version.

@demiankatz
Copy link
Member

Looks good to me (at first glance, anyway -- I don't claim to have performed a comprehensive code review). I can certainly live with the need to rebuild the indexes -- just something to note in the changelog.

@todolson
Copy link

I’ve only managed a superficial look, which seems fine so far. I’ve not tried to merge it with my other changes.

On Dec 11, 2014, at 11:28 AM, Demian Katz <notifications@github.commailto:notifications@github.com> wrote:

Looks good to me (at first glance, anyway -- I don't claim to have performed a comprehensive code review). I can certainly live with the need to rebuild the indexes -- just something to note in the changelog.


Reply to this email directly or view it on GitHubhttps://github.com//pull/255#issuecomment-66655617.

@hackartisan
Copy link
Contributor Author

Mark, just wanted to say Thank You for this - I read it and I'm working on integrating it but I'm not sure whether I'll have any feedback before the holiday (next week and the week after). Will keep you posted.

@hackartisan
Copy link
Contributor Author

Hey Demian, I migrated the theme changes to bootstrap3. The highlight box isn't working when it tries to go around the very first entry (blank search); not sure why. But I'm going to move on to testing Mark's browse handler changes and de-hack-ifying my highlight logic. Leaving bootstrap files as-is; I assume you'll just skip them.

@marktriggs
Copy link
Contributor

Thanks Anna! No rush from me. I'll be pretty busy over the next few weeks too, so maybe we can pick up again once you've had a chance to try it out.

@hackartisan
Copy link
Contributor Author

Hey Demian! Mark and I are done going back and forth over the highlighting support and we're ready for that branch (https://github.com/vufind-org/vufind-browse-handler/tree/highlighting-support) to be merged into master and pulled into vufind. Once that's done I will pull the changes and I can then push my current code (which uses his updated API) to this PR.

@demiankatz
Copy link
Member

Thanks for the update. I believe I've merged everything in now -- let me know if you run into any problems.

@hackartisan
Copy link
Contributor Author

The merge caused conflicts and I thought it would be easier to squash it all into one commit on top of current master. I've created a new branch and I'll submit a PR from there. Hope that works for you; I'm not really sure there is an ideal way to deal with this situation (when time has elapsed and a lot has happened on master).

@demiankatz
Copy link
Member

That's totally fine -- I'll just mention #289 here in case anyone wants to follow the thread to the replacement PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants