-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Confirm behavior of multiterm queries that are not phrase queries #63
Comments
In that case, how is a multiterm query any different than a phrase query? |
In a phrase query, the terms are found in sequence and each sequence counts as a single hit. If individual terms from that phrase appear alone elsewhere on the page, they are not highlighted and don't count as hits. ANDed terms don't have to be in sequence or even near each other on the page, but they must all be present on the page. |
I just posted an example of this in #55 today. IIIF search responsewithout quotes{
"@type": "search:Hit",
"annotations": [
"https://purl.stanford.edu/cc842mn9348/iiif/canvas/cc842mn9348_40/text/at/1660.33,1094.11,140.42,33.89",
"https://purl.stanford.edu/cc842mn9348/iiif/canvas/cc842mn9348_40/text/at/1824.15,1094.11,163.82,33.89"
],
"before": "11,46.81,33.89",
"after": "humanity,"
}, with quotes{
"@type": "search:Hit",
"annotations": [
"https://purl.stanford.edu/cc842mn9348/iiif/canvas/cc842mn9348_40/text/at/1660.33,1094.11,140.42,33.89",
"https://purl.stanford.edu/cc842mn9348/iiif/canvas/cc842mn9348_40/text/at/1824.15,1094.11,163.82,33.89",
"https://purl.stanford.edu/cc842mn9348/iiif/canvas/cc842mn9348_40/text/at/2011.38,1094.11,210.62,33.89"
],
"before": "",
"after": ""
}, Solr responseWithout quotes"ocrtext_en": [
"11,46.81,33.89 <em>crimes☞1660.33,1094.11,140.42,33.89</em> <em>against☞1824.15,1094.11,163.82,33.89</em> humanity,☞2011",
"89,44.90,32.48 <em>crimes☞944.31,1334.89,134.70,32.48</em> <em>against☞1123.92,1334.89,157.15,32.48</em> humanity,☞1303",
"78,44.90,32.48 <em>crimes☞1977.04,1529.78,134.70,32.48</em>\n<em>against☞674.90,1594.74,157.15,32.48</em> humanity,☞854.51",
"67,44.90,32.48 <em>crimes☞1505.58,1724.67,134.70,32.48</em> <em>against☞1662.73,1724.67,157.15,32.48</em> humanity,☞1842",
"56,44.90,32.48 <em>crime☞1595.38,1919.56,112.25,32.48</em> <em>against☞1730.08,1919.56,157.15,32.48</em> humanity,☞1909",
"00 the☞1058.94,573.00,65.32,33.00 <em>crimes☞1146.04,573.00,130.65,33.00</em> to☞1298.46,573.00,43.55,33.00 which☞1363"
], with quotes"ocrtext_en": [
"<em>crimes☞1660.33,1094.11,140.42,33.89 against☞1824.15,1094.11,163.82,33.89 humanity,☞2011.38,1094.11,210.62,33.89</em>",
"<em>crimes☞944.31,1334.89,134.70,32.48 against☞1123.92,1334.89,157.15,32.48 humanity,☞1303.52,1334.89,202.06,32.48</em>",
"<em>crimes☞1977.04,1529.78,134.70,32.48\nagainst☞674.90,1594.74,157.15,32.48 humanity,☞854.51,1594.74,202.06,32.48</em>",
"<em>crimes☞1505.58,1724.67,134.70,32.48 against☞1662.73,1724.67,157.15,32.48 humanity,☞1842.34,1724.67,202.06,32.48</em>",
"<em>crime☞1595.38,1919.56,112.25,32.48 against☞1730.08,1919.56,157.15,32.48 humanity,☞1909.69,1919.56,202.06,32.48</em>"
], highlightingQuestion: Why do we only get two highlights when searching without quotes? without quoteswith quotes |
It seems like the behavior is slightly different with the UnifiedHighlighter referenced on #55 ... |
@jkeck asked me for clarification around expected outcomes or acceptance criteria for this ticket. First, I think this is (for now) analysis to confirm behavior. We should look into the following:
|
I believe in order to accomplish "all unquoted terms must exist in a single canvas in order for a result to appear" we may need to set the
I can understand why this wouldn't normally be desirable in a discovery environment, but in the content search case (w/ an autocomplete available) perhaps this is what we would want. |
i think in searchworks the mm applies over a threshold of 7 or 8 terms (can't remember where it is now) - that is, below the threshold, all terms must be present for the item to be found; above the threshold one or more terms may be missing. i think the same makes sense here. the examples i've seen quoted are all short - 3 or 4 terms - where 100% would be expected. autocomplete adds a new wrinkle in that it changes user expectation. google returns docs that don't have all the terms in the selected autocomplete query, but they indicate in the results that words are missing. |
What do we want to do about matches that cross page boundaries (if anything)? For example "crimes against humanity" exists as a phrase, but "crimes against" appear as the last words of one page, with "humanity" appearing as the first word of the next page? |
In our case, if that did happen, "crimes against" would exist in Document A, and "humanity" would exist in Document B. Is there anything we could do about this? |
are the documents indexed in the same way in the Spotlight search? each page individually? I'm wondering if we're setting up a mismatch in behaviour between Spotlight and the viewer. |
The documents are indexed at the page/canvas level in content search because it is "search within this document" and the level of discovery is a canvas (e.g. I want to enter a search query and be returned pages that have that term on them). Exhibits/Spotlight is search across as opposed to search within, so the indexing is done at the document level. I'm not sure how we meet both these requirements simultaneously:
These seem to be opposite requirements to me. |
So Spotlight is searching the document for the existence of three terms: A B C. They are ANDed, so must all be present in the document. Doc may be returned if A is on page 4, B on page 27, C on pages 3 and 15. Move to the viewer, where user will expect to find the three terms in the document. They do the same search, they get no hits, because all three terms are not present on any one page. That seems like a problem. @ggeisler, your thoughts? I think both requirements above could be wrong for different reasons.
In the context of the viewer only, it technically makes sense, since the page is the unit of discovery and the unit of discovery should, in theory, include all ANDed terms. But it doesn't make sense in the overall flow where the user will find a document for their query, then potentially find no instances of that query within the document.
Then we are essentially treating AND as a phrase. With an AND search you can reasonably expect other words to fall between the terms. How much proximity do we require to highlight terms across 2 pages? Yeah, these can't both be true. |
I agree it's not an ideal situation and is likely to cause confusion for the user who enters the same multiple terms in both Spotlight search and the viewer search. Given we're using a different unit of discovery in the two contexts, I don't have any ideas for a good solution. |
I'm not 100% sure if this is possible, but I wonder if we (in discovery environments) index each canvas into individual multi-valued fields then configure solr to only consider returning results/highlights for hits w/i that instance of the field, and not necessarily in fields across the multiple values. Are we at the point that we need to do some more analysis of the desired behavior? I feel like we're beginning to define it, but not sure it has been explicitly documented anywhere (or maybe that exists somewhere, and we can use that to generate some acceptance criteria). |
From 2/20 sprint planning: the work that remains is the fixing the bug that @aeschylus and @jkeck identified, and exploring with the implementation of the unified highlighter. |
When I have a multiterm query where terms are separated by spaces that is not a phrase query wrapped in quotes, page-level matches include cases where all the terms are matched (i.e., the terms should be
AND
ed together).Discussed w/ @jvine and @ggeisler on 1/31.
The text was updated successfully, but these errors were encountered: